부스트캠프 9주차 학습 일지 - Object Detection 1

사용한 기술 스택들:

11/14 월

학습한 것들:

PR curve: calculated precision and recall from accumulated TP and FP sorted by confidence rate

Average precision: right rectangle estimation of PR curve

mAP(mean average precision): AP of classes/number of classes

- mAP50 means that it only regards IOU over 50 as True Positive

IOU(Intersection Over Union): $\frac{Overlapping\;region}{combined\;region}$

FPS is an important measure for live video object detection

FLOPs (floating point operations): count of the operation performed

MMDetection: object detection open source written in PyTorch

Detectron2: Meta ai research library for object detection and segmentation

YOLOv5: coco pretrained model that is well developed

EfficientDet: image detection model based on efficientnet made by google

R-CNN:

Extract Region proposals
1. Sliding Window: use a fix-sized box to move that across the image to get bounding boxes
2. Selective search: do an initial segmentation and add those together to get larger bounding boxes
  1. 2000 ROI
Compute CNN features
1. AlexNet
Classify
Adjust bounding box

R-CNN is not end-to-end

SPP-Net:

Forward the whole image through ConvNet
Extract ROI
Spatial Pyramid Pooling Layer
1. make the ROI the same size by passing through the layer instead of warping
FC layer
classify regions with SVM

Fast R-CNN:

Forward the whole image through VGG16
ROI projection to get ROI
1. project the selective search ROI to the output of VGG16
2. one batch only contains the ROI of an image
ROI pooling to get features with the same size
1. pyramid level 1 with 7x7 grid size
FC layer
Softmax classifier + bounding box regressor

R-CNN, SPP-Net, and Fast R-CNN are not end-to-end

Faster R-CNN:

Forward images through a network to get feature maps
Use Region Proposal Network(RPN) to get ROI
1. Replace selective search
2. Anchor box
  1. divide the image into cells that have different bound box sizes and numbers for each
  2. RPN predicts if there is an object in the cell and the transformation needed for the anchor boxes
  3. do 3x3 to make 512 channel
  4. 1x1x2 for binary classification of the existence of the object
  5. 1x1x4 for bounding box regression
3. NMS: remove the bounding box based on IoU and class score

11/15 화

학습한 것들:

mmdetection: 많은 프레임워크를 지원하고 빠름

- pytorch 기반 오픈소스 라이브러리

- Pipeline: Input, backbone, neck, dense prediction, prediction

- config 파일로 설정

- config 상속 받고 부분만 바꿈

Config 기본 구조:

dataset: coco, VOC, cityscape
model: faster_rcnn, RetinaNet, RPN
- 2stage model
  - type: type of model
  - backbone: a network that converts an image to feature map
    - can add a custom backbone
  - neck: connects backbone and head
  - rpn_head: region proposal network
  - RoI_head: region of interest
schedule
default_runtime

Detectron2: OD 말고 다른 알고리즘들도 지원함

- Pipeline: Setup config, setup trainer, start training

- 학습 방식은 mmdetection과 비슷함

Neck: Backbone과 RPN을 연결시켜주는 역할

- backbone의 중간 feature들도 사용하면서 다양한 크기의 객체를 더 잘 탐지할 수 있다

- 하위 level의 feature은 semantic이 약하므로 상대적으로 sematic이 강한 상위 feature와의 교환이 필요

- level당 feature을 섞어줌

featurized image pyramid: various resized image that is used to get the feature

single feature map: get output as the feature by passing an image

pyramidal feature hierarchy: pass through like a single feature map, but use the middle layer's feature too

feature pyramid network: give information from the high level to the low level by creating a top-down pathway

down-top and top-down features are added by doing 1x1 conv for the lateral connection and 2x upsampling convolution for the top-down pathway

Path Aggregation Network(PANet): add down-top pathway after top-down pathway for deep CNN

do RoI pooling for all the features

DetectoRS: looking and thinking twice

recursive feature pyramid: FPN that is done recursively
ASPP: give different dilation rates to increase the convolution receptive field size

EfficientDet: PANet that removes the node that is useless

weighted feature fusion: give weight to the layers to differentiate low level and high level
connect the lateral pathway to the down-top pathway too

NASFPN: find the FPN architecture by neural network search

Not generalizable

AugFPN: to solve the problem of loss of information on the highest feature map

Residual Feature Augmentation: give semantic information of a high level directly to the final pyramid
- ratio-invariant adaptive pooling
Soft RoI selection: use all features to get RoI by using weights

1-stage detectors: localization and classification at the same time

- fast and easy design

- taken into context

- YOLO, SSD, RetinaNet

You Only Look Once(YOLO): first 1-stage detector

- modified GoogLeNet

divide into the grid area
get b number of bounding boxes and a confidence score for each grid
1. confidence score: Prob(Object existing) * IOU of truth and pred
get the probability of class for each grid
1. conditional class probability: Pr(Class|Object)
The output contains 30 channels
1. 5 channels each for 2 bbox
  1. x coordinate of the center of the grid cell
  2. y coordinate of the center of the grid cell
  3. width of bbox
  4. height of bbox
  5. bbox confidence score
2. class maps
multiply the bbox confidence score by the class maps to get the probability of the bbox being the bbox for the object
make the probability zero if under a certain threshold and sort in descending order
use NMS to remove redundant bbox

SSD: to solve the problem of detecting small-sized objects and using only the last layer of the feature

- use 6 different scale feature maps: a big feature map predicts small objects while a small feature map predicts large objects

- use only the convolution layer

- use anchor box

- VGG-16 as the backbone

YOLO v2:

- higher resolution

- convolution with anchor boxes

- no FC layer

- batch normalization

- add early feature map to late feature map

- multi-scale training

- Darknet-19

- used WordTree which combined ImageNet and COCO to make a hierarchical dataset

YOLO v3:

- Darknet-53

- convolution stride 2

- use 3 different scales

- use Feature Pyramid Network

RetinaNet:

- to solve the problem of 1 stage detector having too many negative samples

- use new loss function (Focal Loss): cross-entropy loss + scaling factor (more importance on harder cases)

- Improvement in performance

11/16 수

학습한 것들:

Width scaling: used for a small model to get small details well

Depth scaling: used in many models to get complex and rich features but it has a problem of gradient vanishing

Resolution scaling: can get details very well

EfficientDet: efficiently scales the model

- match the width, depth, and resolution balance to achieve great performance with low computational cost

- idea from EfficientNet

- efficiency is needed for real-time

Efficient multi-scale feature fusion
1. remove the node with one edge only
2. add input to output by adding an edge
3. use repeated block
4. used a weighted sum of various resolutions
  1. BiFPN: weight passes through ReLU so that it does not become 0 and also add epsilon to make denominator non-zero, basically a weighted sum
model scaling: compound scaling like EfficientNet

Cascade RCNN: explored change when the threshold for the positive and negative sample is changed

- higher the input IoU, the better performance for a model that is trained with the higher threshold

- higher the threshold, it performs better when the AP IoU threshold is higher

- train multiple RoI heads, and set the IoU threshold differently for each head, the bounding box of the previous head is applied to the next head

- Iterative + Integral = Cascade

DCN(Deformable Convolutional Networks):

normal CNN is weak against geometric transformation
- traditional method: geometric augmentation, geometric invariant feature selection
when the convolution kernel is multiplied, give some geometric offset in the middle of convolutional operations
- has an offset field that contains an offset vector
- the model learns the offset of the feature
good performance in object detection and segmentation

The problem with ViT is that it has a high computational cost and needs a lot of data to train

DETR(End-to-End object detection with transformer:

replaces the need for NMS
use a high-level feature map because it needs a high computational cost
Pipeline
- Input
- CNN
- encoder + positional encoding
- decoder
- Feed forward Network
- N output
  - N> number of objects in the image
  - pad objects as no object by the amount of difference between N and the number of objects
    - this allows getting a precise amount of objects as output

Swin Transformer: use an architecture called window to reduce the computational cost

No class embedding
Two attention per transformer block
embedding is divided by the unit of the window, so the image is divided into many windows which decreases the computational cost of the model
- has a problem of not using other parts of the window as consideration, so Shifted Window Multi-Head Attention corrects that by different divisions of windows
trains well with a low amount of data

'잡다한 것들 > 부스트캠프 AI Tech 4기' 카테고리의 다른 글

부스트캠프 14주차 학습 일지 - Semantic Segmentation (0)	2022.12.19
부스트캠프 12주차 학습 일지 - 데이터 제작 (0)	2022.12.05
부스트캠프 8주차 학습 일지 - AI 서비스 개발 기초 (0)	2022.11.07
CV 기초대회 최종 회고 (0)	2022.11.04
6주차 학습 일지 - CV 기초대회 (0)	2022.10.24

11/14 월

학습한 것들:

11/15 화

학습한 것들:

11/16 수

학습한 것들:

'잡다한 것들 > 부스트캠프 AI Tech 4기' 카테고리의 다른 글

티스토리툴바