부스트캠프 5주차 학습 일지 - Computer Vision Basics

사용한 기술 스택들:

10/18 화

학습한 것들:

CNN visualization aims to see what's inside CNN(black box)

- CNN visualization can be used to debug

Filter visualization: can be used to show activation visualization of an image

- it is hard to visualize like this after the first convolution layer

Two focus: focus on data/focus on models

Nearest neighbors in feature space: can look for clusters that are semantically similar/ and not just pixel-wise comparison

- locate each image to a high-dimensional feature space

In order to reduce that high dimensional space into observable 2d space, a technique called t-SNE(t-distributed stochastic neighbor embedding is used)

Use the channel's activation in layers to look for where the network is putting attention

- crop image around the max activation to make patches of what the channel is focusing on

Class visualization: generating a synthetic image that triggers maximal class activation

- use gradient ascent

- get the prediction score of a dummy image, backpropagate to maximize the target class until the image, and change the image with the gradient

Saliency by Occlusion map: hide some of the images, see how the score changes according to the location of the mask, and apply the mask throughout the image to get a heatmap representation that tells which part is important

Saliency by backpropagation: get the class score of the target source image, backpropagate till the image and visualize the gradient magnitude map

- when going backpropagation with relu, apply relu to the gradient too (DeConv)

- save the relu pattern and apply it backward(Backprop)

- add the top two methods together (guided backpropagation)

CAM(Class activation mapping): check what part of the image is contributing to the classification

- use global average pooling instead of fully connected

- can interpret why the network classified the input to that class, GAP enables localization without supervision

- ResNet and GoogLeNet already have the GAP layer

Grad-CAM: use the normal model and backpropagate until the convolutional layer, and apply global pooling to get CAM

Guided grad-cam = grad-CAM+ guided backpropagation

GAN dissection: use interpretation not only for analysis but for interpretation

Instance segmentation: semantic segmentation + distinguishing instances

Mask R-CNN: Faster R-CNN + Mask branch

- use RoIAlign instead of RoI

- Mask branch for binary classification of each class

YOLACT(You Only Look At CoeefficienTs): one-stage instance segmentation

- use protonet to assemble each instance to the final output

YolactEdge: extending YOLOACT to video

Panoptic segmentation: stuff + instances of things

UPSNet: semantic and instance head → Panoptic head → panoptic logits

VPSNet: UPSNet for video

- fusion at pixel level

- track instances at the object level

Landmark localization: predicting the coordinates of key points

- Coordinate regression: inaccurate and biased

- Heatmap classification: better performance but computationally expensive

Landmark location can be converted to a gaussian heatmap

Stacked hourglass modules allow for repeated bottom-up and top-down inference that refines the output of the previous hourglass module

- similar to UNet, but it is not skipped directly but is rather skipped through a convolutional layer

UV map: flattened representation of 3D geometry, invariant to motion

DensePose R-CNN: 3D landmark localization using faster R-CNN and 3D surface regression branch

RetinaFace: feature pyramid network + multi-task branches

One can detect objects using key points by landmark detection

CornerNet: use two(top-left, bottom-right) corners for bounding box

CenterNet 1: add a center point to the CornerNet

CenterNet 2: use width, height, and center to find the bounding box

10/19 수

학습한 것들:

Autograd: automatic gradient calculating API

- requires_grad argument allows storing the gradient to backpropagate

- retain_graph argument allows to not free intermediate resource for multiple calculations of gradients

- hook allows retaining the gradient when it is getting calculated

- when applying hook, do not modify the argument but return a new tensor

Conditional generative model: explicitly generate an image corresponding to a given condition

- can be used to translate images, super-resolution, etc

Using regression produces a safe average-looking image because of MAE and MSE, but GAN loss implicitly compares whether it has seen a fake or real image

Pix2Pix: translating an image to another style of image

- GAN loss induces more realistic output close to the real distribution

- semantic map to photo, colorization

- need pairwise data

CycleGAN: allows translation between domains with non-pairwise datasets

- loss: GAN loss(in both direction) + Cycle-consistency loss

- cycle-consistency loss is calculated by converting X to Y and back again and finding the difference

Perceptual loss: by utilizing pretrained classifiers, it can be used to utilize the perception just like human

- make a loss network with VGG to get the loss for the network of image translation

10/20 금

multimodal: using multiple inputs for an output

problems of multi-modal learning:

- it is a problem because the shape of the input is different in multimodal

- also, there is an unbalance between heterogeneous feature spaces

- bias on specific modality

Text embedding: the text is mapped to dense vectors

- learning dense representation allows for generalization

Word2vec: skip-gram model

- learn to predict neighboring n words for understanding the relationship between words

Joint embedding: sticking two models together and adding them later on in the last layer

- image tagging

Cross modal translation: change one type of modal to another type

- image captioning

- read images, CNN, attention to a specific part, and use that to generate text

- text-to-image

- cGAN, make generator and discriminator that adds the text data as input too for both the generator network and the discriminator network

Cross modal reasoning: using multiple modals to infer something

- visual question answering

Sound representation: use Fourier transform to convert wave from to power spectrum, and stack the spectrum along the time axis to make a spectrogram for learning

SoundNet: learn audio representation from synchronized RGB frames

- teacher-student manner: knowledge from the visual model is transferred to the sound model

Speech2Face: trained in a self-supervised manner for making features compatible

Image2Speech: image, CNN, Attention, sub-word unit, to speech

Sound source localization: use the audio net and visual net with an attention net to visualize where the sound comes from

3D data is represented in many types of styles such as mesh, volumetric, part assembly, point cloud

3D object recognition: use 3D CNN

3D object detection: useful for autonomous driving

3D object segmentation: useful for neuroimaging

Transformer: long-term dependency by attention

'잡다한 것들 > 부스트캠프 AI Tech 4기' 카테고리의 다른 글

CV 기초대회 최종 회고 (0)	2022.11.04
6주차 학습 일지 - CV 기초대회 (0)	2022.10.24
부스트캠프 4주차 학습 일지 - Computer Vision Basics (0)	2022.10.11
부스트캠프 3주차 학습 일지 - Deep Learning Basics (0)	2022.10.03
부스트캠프 2주차 학습 일지 - Pytorch Basics (0)	2022.09.26

10/18 화

학습한 것들:

10/19 수

학습한 것들:

10/20 금

'잡다한 것들 > 부스트캠프 AI Tech 4기' 카테고리의 다른 글

티스토리툴바