11/14 월
학습한 것들:
PR curve: calculated precision and recall from accumulated TP and FP sorted by confidence rate
Average precision: right rectangle estimation of PR curve
mAP(mean average precision): AP of classes/number of classes
- mAP50 means that it only regards IOU over 50 as True Positive
IOU(Intersection Over Union): $\frac{Overlapping\;region}{combined\;region}$
FPS is an important measure for live video object detection
FLOPs (floating point operations): count of the operation performed
MMDetection: object detection open source written in PyTorch
Detectron2: Meta ai research library for object detection and segmentation
YOLOv5: coco pretrained model that is well developed
EfficientDet: image detection model based on efficientnet made by google
- Extract Region proposals
- Sliding Window: use a fix-sized box to move that across the image to get bounding boxes
- Selective search: do an initial segmentation and add those together to get larger bounding boxes
- 2000 ROI
- Compute CNN features
- AlexNet
- Classify
- Adjust bounding box
R-CNN is not end-to-end
- Forward the whole image through ConvNet
- Extract ROI
- Spatial Pyramid Pooling Layer
- make the ROI the same size by passing through the layer instead of warping
- FC layer
- classify regions with SVM
Fast R-CNN:
- Forward the whole image through VGG16
- ROI projection to get ROI
- project the selective search ROI to the output of VGG16
- one batch only contains the ROI of an image
- ROI pooling to get features with the same size
- pyramid level 1 with 7x7 grid size
- FC layer
- Softmax classifier + bounding box regressor
R-CNN, SPP-Net, and Fast R-CNN are not end-to-end
Faster R-CNN:
- Forward images through a network to get feature maps
- Use Region Proposal Network(RPN) to get ROI
- Replace selective search
- Anchor box
- divide the image into cells that have different bound box sizes and numbers for each
- RPN predicts if there is an object in the cell and the transformation needed for the anchor boxes
- do 3x3 to make 512 channel
- 1x1x2 for binary classification of the existence of the object
- 1x1x4 for bounding box regression
- NMS: remove the bounding box based on IoU and class score
11/15 화
학습한 것들:
mmdetection: 많은 프레임워크를 지원하고 빠름
- pytorch 기반 오픈소스 라이브러리
- Pipeline: Input, backbone, neck, dense prediction, prediction
- config 파일로 설정
- config 상속 받고 부분만 바꿈
Config 기본 구조:
- dataset: coco, VOC, cityscape
- model: faster_rcnn, RetinaNet, RPN
- 2stage model
- type: type of model
- backbone: a network that converts an image to feature map
- can add a custom backbone
- neck: connects backbone and head
- rpn_head: region proposal network
- RoI_head: region of interest
- 2stage model
- schedule
- default_runtime
Detectron2: OD 말고 다른 알고리즘들도 지원함
- Pipeline: Setup config, setup trainer, start training
- 학습 방식은 mmdetection과 비슷함
Neck: Backbone과 RPN을 연결시켜주는 역할
- backbone의 중간 feature들도 사용하면서 다양한 크기의 객체를 더 잘 탐지할 수 있다
- 하위 level의 feature은 semantic이 약하므로 상대적으로 sematic이 강한 상위 feature와의 교환이 필요
- level당 feature을 섞어줌
featurized image pyramid: various resized image that is used to get the feature
single feature map: get output as the feature by passing an image
pyramidal feature hierarchy: pass through like a single feature map, but use the middle layer's feature too
feature pyramid network: give information from the high level to the low level by creating a top-down pathway
- down-top and top-down features are added by doing 1x1 conv for the lateral connection and 2x upsampling convolution for the top-down pathway
Path Aggregation Network(PANet): add down-top pathway after top-down pathway for deep CNN
- do RoI pooling for all the features
DetectoRS: looking and thinking twice
- recursive feature pyramid: FPN that is done recursively
- ASPP: give different dilation rates to increase the convolution receptive field size
EfficientDet: PANet that removes the node that is useless
- weighted feature fusion: give weight to the layers to differentiate low level and high level
- connect the lateral pathway to the down-top pathway too
NASFPN: find the FPN architecture by neural network search
- Not generalizable
AugFPN: to solve the problem of loss of information on the highest feature map
- Residual Feature Augmentation: give semantic information of a high level directly to the final pyramid
- ratio-invariant adaptive pooling
- Soft RoI selection: use all features to get RoI by using weights
1-stage detectors: localization and classification at the same time
- fast and easy design
- taken into context
- YOLO, SSD, RetinaNet
You Only Look Once(YOLO): first 1-stage detector
- modified GoogLeNet
- divide into the grid area
- get b number of bounding boxes and a confidence score for each grid
- confidence score: Prob(Object existing) * IOU of truth and pred
- get the probability of class for each grid
- conditional class probability: Pr(Class|Object)
- The output contains 30 channels
- 5 channels each for 2 bbox
- x coordinate of the center of the grid cell
- y coordinate of the center of the grid cell
- width of bbox
- height of bbox
- bbox confidence score
- class maps
- 5 channels each for 2 bbox
- multiply the bbox confidence score by the class maps to get the probability of the bbox being the bbox for the object
- make the probability zero if under a certain threshold and sort in descending order
- use NMS to remove redundant bbox
SSD: to solve the problem of detecting small-sized objects and using only the last layer of the feature
- use 6 different scale feature maps: a big feature map predicts small objects while a small feature map predicts large objects
- use only the convolution layer
- use anchor box
- VGG-16 as the backbone
YOLO v2:
- higher resolution
- convolution with anchor boxes
- no FC layer
- batch normalization
- add early feature map to late feature map
- multi-scale training
- Darknet-19
- used WordTree which combined ImageNet and COCO to make a hierarchical dataset
YOLO v3:
- Darknet-53
- convolution stride 2
- use 3 different scales
- use Feature Pyramid Network
- to solve the problem of 1 stage detector having too many negative samples
- use new loss function (Focal Loss): cross-entropy loss + scaling factor (more importance on harder cases)
- Improvement in performance
11/16 수
학습한 것들:
Width scaling: used for a small model to get small details well
Depth scaling: used in many models to get complex and rich features but it has a problem of gradient vanishing
Resolution scaling: can get details very well
EfficientDet: efficiently scales the model
- match the width, depth, and resolution balance to achieve great performance with low computational cost
- idea from EfficientNet
- efficiency is needed for real-time
- Efficient multi-scale feature fusion
- remove the node with one edge only
- add input to output by adding an edge
- use repeated block
- used a weighted sum of various resolutions
- BiFPN: weight passes through ReLU so that it does not become 0 and also add epsilon to make denominator non-zero, basically a weighted sum
- model scaling: compound scaling like EfficientNet
Cascade RCNN: explored change when the threshold for the positive and negative sample is changed
- higher the input IoU, the better performance for a model that is trained with the higher threshold
- higher the threshold, it performs better when the AP IoU threshold is higher
- train multiple RoI heads, and set the IoU threshold differently for each head, the bounding box of the previous head is applied to the next head
- Iterative + Integral = Cascade
DCN(Deformable Convolutional Networks):
- normal CNN is weak against geometric transformation
- traditional method: geometric augmentation, geometric invariant feature selection
- when the convolution kernel is multiplied, give some geometric offset in the middle of convolutional operations
- has an offset field that contains an offset vector
- the model learns the offset of the feature
- good performance in object detection and segmentation
The problem with ViT is that it has a high computational cost and needs a lot of data to train
DETR(End-to-End object detection with transformer:
- replaces the need for NMS
- use a high-level feature map because it needs a high computational cost
- Pipeline
- Input
- encoder + positional encoding
- decoder
- Feed forward Network
- N output
- N> number of objects in the image
- pad objects as no object by the amount of difference between N and the number of objects
- this allows getting a precise amount of objects as output
Swin Transformer: use an architecture called window to reduce the computational cost
- No class embedding
- Two attention per transformer block
- embedding is divided by the unit of the window, so the image is divided into many windows which decreases the computational cost of the model
- has a problem of not using other parts of the window as consideration, so Shifted Window Multi-Head Attention corrects that by different divisions of windows
- trains well with a low amount of data
