10/18 화
학습한 것들:
CNN visualization aims to see what's inside CNN(black box)
- CNN visualization can be used to debug
Filter visualization: can be used to show activation visualization of an image
- it is hard to visualize like this after the first convolution layer
Two focus: focus on data/focus on models
Nearest neighbors in feature space: can look for clusters that are semantically similar/ and not just pixel-wise comparison
- locate each image to a high-dimensional feature space
In order to reduce that high dimensional space into observable 2d space, a technique called t-SNE(t-distributed stochastic neighbor embedding is used)
Use the channel's activation in layers to look for where the network is putting attention
- crop image around the max activation to make patches of what the channel is focusing on
Class visualization: generating a synthetic image that triggers maximal class activation
- use gradient ascent
- get the prediction score of a dummy image, backpropagate to maximize the target class until the image, and change the image with the gradient
Saliency by Occlusion map: hide some of the images, see how the score changes according to the location of the mask, and apply the mask throughout the image to get a heatmap representation that tells which part is important
Saliency by backpropagation: get the class score of the target source image, backpropagate till the image and visualize the gradient magnitude map
- when going backpropagation with relu, apply relu to the gradient too (DeConv)
- save the relu pattern and apply it backward(Backprop)
- add the top two methods together (guided backpropagation)
CAM(Class activation mapping): check what part of the image is contributing to the classification
- use global average pooling instead of fully connected
- can interpret why the network classified the input to that class, GAP enables localization without supervision
- ResNet and GoogLeNet already have the GAP layer
Grad-CAM: use the normal model and backpropagate until the convolutional layer, and apply global pooling to get CAM
Guided grad-cam = grad-CAM+ guided backpropagation
GAN dissection: use interpretation not only for analysis but for interpretation
Instance segmentation: semantic segmentation + distinguishing instances
Mask R-CNN: Faster R-CNN + Mask branch
- use RoIAlign instead of RoI
- Mask branch for binary classification of each class
YOLACT(You Only Look At CoeefficienTs): one-stage instance segmentation
- use protonet to assemble each instance to the final output
YolactEdge: extending YOLOACT to video
Panoptic segmentation: stuff + instances of things
UPSNet: semantic and instance head → Panoptic head → panoptic logits
VPSNet: UPSNet for video
- fusion at pixel level
- track instances at the object level
Landmark localization: predicting the coordinates of key points
- Coordinate regression: inaccurate and biased
- Heatmap classification: better performance but computationally expensive
Landmark location can be converted to a gaussian heatmap
Stacked hourglass modules allow for repeated bottom-up and top-down inference that refines the output of the previous hourglass module
- similar to UNet, but it is not skipped directly but is rather skipped through a convolutional layer
UV map: flattened representation of 3D geometry, invariant to motion
DensePose R-CNN: 3D landmark localization using faster R-CNN and 3D surface regression branch
RetinaFace: feature pyramid network + multi-task branches
One can detect objects using key points by landmark detection
CornerNet: use two(top-left, bottom-right) corners for bounding box
CenterNet 1: add a center point to the CornerNet
CenterNet 2: use width, height, and center to find the bounding box
10/19 수
학습한 것들:
Autograd: automatic gradient calculating API
- requires_grad argument allows storing the gradient to backpropagate
- retain_graph argument allows to not free intermediate resource for multiple calculations of gradients
- hook allows retaining the gradient when it is getting calculated
- when applying hook, do not modify the argument but return a new tensor
Conditional generative model: explicitly generate an image corresponding to a given condition
- can be used to translate images, super-resolution, etc
Using regression produces a safe average-looking image because of MAE and MSE, but GAN loss implicitly compares whether it has seen a fake or real image
Pix2Pix: translating an image to another style of image
- GAN loss induces more realistic output close to the real distribution
- semantic map to photo, colorization
- need pairwise data
CycleGAN: allows translation between domains with non-pairwise datasets
- loss: GAN loss(in both direction) + Cycle-consistency loss
- cycle-consistency loss is calculated by converting X to Y and back again and finding the difference
Perceptual loss: by utilizing pretrained classifiers, it can be used to utilize the perception just like human
- make a loss network with VGG to get the loss for the network of image translation
10/20 금
multimodal: using multiple inputs for an output
problems of multi-modal learning:
- it is a problem because the shape of the input is different in multimodal
- also, there is an unbalance between heterogeneous feature spaces
- bias on specific modality
Text embedding: the text is mapped to dense vectors
- learning dense representation allows for generalization
Word2vec: skip-gram model
- learn to predict neighboring n words for understanding the relationship between words
Joint embedding: sticking two models together and adding them later on in the last layer
- image tagging
Cross modal translation: change one type of modal to another type
- image captioning
- read images, CNN, attention to a specific part, and use that to generate text
- text-to-image
- cGAN, make generator and discriminator that adds the text data as input too for both the generator network and the discriminator network
Cross modal reasoning: using multiple modals to infer something
- visual question answering
Sound representation: use Fourier transform to convert wave from to power spectrum, and stack the spectrum along the time axis to make a spectrogram for learning
SoundNet: learn audio representation from synchronized RGB frames
- teacher-student manner: knowledge from the visual model is transferred to the sound model
Speech2Face: trained in a self-supervised manner for making features compatible
Image2Speech: image, CNN, Attention, sub-word unit, to speech
Sound source localization: use the audio net and visual net with an attention net to visualize where the sound comes from
3D data is represented in many types of styles such as mesh, volumetric, part assembly, point cloud
3D object recognition: use 3D CNN
3D object detection: useful for autonomous driving
3D object segmentation: useful for neuroimaging
Transformer: long-term dependency by attention
'잡다한 것들 > 부스트캠프 AI Tech 4기' 카테고리의 다른 글
CV 기초대회 최종 회고 (0) | 2022.11.04 |
---|---|
6주차 학습 일지 - CV 기초대회 (0) | 2022.10.24 |
부스트캠프 4주차 학습 일지 - Computer Vision Basics (0) | 2022.10.11 |
부스트캠프 3주차 학습 일지 - Deep Learning Basics (0) | 2022.10.03 |
부스트캠프 2주차 학습 일지 - Pytorch Basics (0) | 2022.09.26 |