10/4 화
학습한 것들:
머신 러닝: 데이터로 인공지능을 학습
딥러닝: neural network를 사용하는 머신러닝
Key components of deep learning:
- data
- model
- loss
- algorithm
딥러닝의 흐름:
2012 - AlexNet
2013 - DQN (DeepMind의 시초)
2014 - Encoder/Decoder (input을 encoding 해서 원하는 output으로 decode)
2014 - Adam (결과가 잘 나오는 optimizer)
2015 - Generative Adversarial Network
2015 - ResNet(Residual Networks)
2017 - Transformer
2018 - BERT(Bidirectional Encoder Representations from Transformers
2019 - Big Language Models (GPT-X)
2020 - Self Supervised Learning
회고:
대회에서는 역시 nlp가 답인거 같다..
10/5 수
학습한 것들:
Generalization: how well the learned model will behave on unseen data
- avoiding overfitting
cross-validation: cycling through folds of train data and selecting one as validation data
Bias and variance: reducing bias = increasing variance and vice versa
Bootstrapping: using random sampling with replacement
Bagging(Bootstrapping aggregating): multiple models are being trained and predictions aggregated
Boosting: focus on samples that are hard to classify
- bringing together a set of learners in which the learner learns from the mistake of the previous weak learner
Small batch를 쓰는게 더 좋다 하지만 시간이 더 걸린다
SGD: learning rate 만큼 gradient 변경
Momentum: 관성처럼 그전 batch의 결과를 현재 batch에 반영한다
Nesterov accelerated gradient: gradient로 변경된 그곳에서 gradient를 한번 더 해줌으로써 momentum이 최솟값 근처에서 돌아다니는 걸 방지해준다
Adagrad: parameter의 변화 클수록 learning rate를 줄이고 변화가 작으면 learning rate을 늘린다
Adadelta: learning rate이 없는 adagrad의 활용
RMSprop: Exponential Moving Average를 이용한다
Adam(Adaptive Moment Estimation): momentum + adaptive learning rate
Regularization:
- Early stopping
- stopping training when validation errors start to increase
- Parameter Norm Penalty
- makes sure the parameter does not become too big, smooths function space
- Data Augmentation
- more data, the better
- orientating images to make more similar but different data
- Noise Robustness
- add random noise to images
- Label Smoothing
- mix-up: mixing input and output of two randomly selected training data
- cut-mix: similar to mix-up but replaces part of an image with another data
- Dropout
- randomly set some neurons to zero
- Batch Normalization
- normalize the data of batch
Convolution: can blur, emboss, outline an image
CNN: consists of convolution layer, pooling layer, and fully connected layer
- convolution layer and pooling layer is used for feature extraction
- Fully connected layer is used for decision making
Stride: skipping pixels
Padding: giving corners extra data to make up for not being able to reach corners as much
Number of parameters: ((width of filter * length of filter * depth of input channel)+1)* depth of output channel)
1x1 convolution: dimension reduction to reduce the number of parameters while increasing the depth of CNN
AlexNet 성공 이유: ReLU, 2 GPU, local response normalization, dropout, data augmentation, overlapping pooling
ReLU:
- preserves the property of a linear model
- easy to optimize
- good generalization
- overcome the vanishing gradient problem
VGGNet: only 3x3, dropout, 1x1 convolution
- using two 3x3 is better than 5x5 because it has less
GoogLeNet: inception block 사용(parameter 줄이기)
- 1x1 parameter을 사용하면 parameter을 줄일수 있음
ResNet: solved the problem of overfitting(excessive number of parameters), a deeper neural network is harder to train
- utilized skip connection
- bottleneck architecture(using 1x1)
DenseNet: concatenate instead of addition
- transition block: reduce parameter
Semantic segmentation: classify object by pixel
- fully convolutional network(no dense layer)
- this process allows to output heatmap
Deconvolution: inverse of convolution
Detection: creating a bounding box
- R-CNN: make a lot of region proposals, CNN using AlexNet, using SVM
- SPPNet: CNN runs once
- Fast R-CNN: makes bounding box regressor
- Faster R-CNN: region proposal network+Fast R-CNN
- YOLO: simultaneously predicts multiple bounding boxes and class probability
Sequential Model: cannot know the size of the input
RNN(Recurrent Neural Network): feed the output of the past as input
- problem: short-term dependencies, vanishing/exploding gradient
LSTM(Long Short Term Memory): solves the problem of RNN by introducing the previous cell state, hidden state
- forget gate: decide what to throw away
- input gate: decide what to store
- update cell
- output gate
GRU(Gated Recurrent Unit): no hidden state(reset gate and update gate)
회고:
배운게 진짜 많아서 차근차근 복습을 해봐야 할 것 같다.
10/6 목
학습한 것들:
Transformer: first sequence transduction model based entirely on attention
- structure: change a sequence to another sequence
- can encode many sequences at once unlike RNN
- stacks of encoder and decoder
Encoder: self-attention and feed-forward neural network
Steps for encoder:
- represent words with embedding vectors (give a unique number for each word)
- transformer encodes each word to feature vectors with self-attention
- self-attention uses the information of other words while they are put into the encoder as inputs
- the feed-forward network is independent while the path of self-attention is dependent upon each other
- self-attention looks for relationships between words (ex. "it" in a sentence refers to "the animals")
- for each vector, makes 3 vectors each with a neural network of own
- queries (Q)
- keys (K)
- values (V)
- find the score by finding a dot product of the query of self and keys to everything
- the score tells how much interaction is needed for the word
- divide the score by $\sqrt{d_x}$
- find softmax of score
- multiply the softmax with the values
- add all the values to get a final representation of the word
- $softmax \left(\frac{Q \times K^T}{\sqrt{d_x}} \right)\times V=Z$
- concatenate all the Z matrice
- multiply with the weight matrix to produce the outcome dimension that is the same as the input
- add positional encodings to make sure the order is taken into account
- find layer norm
- feed-forward
- repeat
The time complexity of transformer: $O(N^2)$ because we need to iterate through all words for each word
$i$ number of heads used as input, $i$ number of attention heads(encoded vectors) as output
key vector and value vector are sent to the decoder
a self-attention layer is only allowed to attend to an earlier position by masking
in encoder-decoder attention, creates a query matrix from the layer below and use values and keys from the encoder
Vision transformer used for image classification too
회고:
제일 어려운 GAN이랑 Transformer을 배웠다 나중에 더 따로 공부를 해야할 것 같다.
10/7 금
학습한 것들:
Generative Model: learning a probability distribution $p(x)$
- Bernoulli distribution
- Categorical distribution
Independence modeling lowers parameters, but it removes the dependency
- can solve this problem by Markov's assumption
Autoregressive models leverage this conditional independency through Markov's assumption
Autoregressive model: predicting next term based on previous terms
- needs ordering of random variables
NADE(Neural Autoregressive Density Estimator): is an explicit model that can compute the density of given input
Summary of Autoregressive Model: easy to sample, easy to compute the probability, easy to be extended to continuous variables
Maximum likelihood learning: minimizing KL-divergence maximizes the expected log-likelihood
- approximate the expected log-likelihood with the empirical log-likelihood
ERM(Empirical Risk Minimization): often used method for maximum likelihood learning
- prone to overfitting
- reduce model space
Autoencoder is not a generative model
Variational Autoencoder aims to maximize $p(x)$
GAN(Generative Adversarial Networks): discriminator and generator
Diffusion Model: make an image from noise progressively
- diffusion process: inject noise
- reverse process: denoise the image
회고:
이번주도 끝이 났다. 다음주도 열심히 달려보자!
'잡다한 것들 > 부스트캠프 AI Tech 4기' 카테고리의 다른 글
6주차 학습 일지 - CV 기초대회 (0) | 2022.10.24 |
---|---|
부스트캠프 5주차 학습 일지 - Computer Vision Basics (0) | 2022.10.18 |
부스트캠프 4주차 학습 일지 - Computer Vision Basics (0) | 2022.10.11 |
부스트캠프 2주차 학습 일지 - Pytorch Basics (0) | 2022.09.26 |
부스트캠프 1주차 학습 일지 - Python & AI Math (0) | 2022.09.19 |