# Object Detection in Images

### Introduction

This post is a brief summary of three algorithms for object detection and segmentation:

1. Fast R-CNN
2. Faster R-CNN

All of them are region based algorithm. The following concepts are involved in the discussion and paper

• Location proposals
• Region of interest (ROI)
• ROI pooling
• Quantification

### Fast R-CNN

The input of the fast R-CNN are

• Image
• Location proposals

In Fast R-CNN, the location proposals are calculated using traditional computer vision technique Selective Search. The figure below shows the architecture of fast R-CNN.

### Faster R-CNN

The bottleneck of Fats R-CNN is the calculation of location proposals. The idea of Faster R-CNN is to replace Selective Search algorithm with a neural network called Region Proposal Network (RPN). In the paper, the author also define a notion called anchor. Anchor is essentially a tuple (position_of_bounding_box_center, scale_of_bounding_box, aspect_ratio_of_bounding_box) and it represents a bounding box on the original image.

Note that regions proposed by Selective Search are regions which are believed to have an object by the algorithm. Therefore, when we connect RPN to Fast R-CNN we should use regions(or more precisely anchors) that are likely to have objects as well.

#### A close look at RPN

Suppose the size of the original image is (height=800, width=800). Now the input of RPN is the convolution feature map. We usually have max pooling in the convolutional network so for simplicity, we assume the feature map is of size (height=100, width=100, feature=256). According to the paper, we will add a conv network with $$3 \times 3$$ sliding window. We assume the output layer of this conv network is (height=100, width=100, feature=120). Now we will use this output layer to predict if there is an object in the anchors and the location of the bounding box.

A brief summary of what we have so far. The feature map has size (height=100, width=100). Each cell in the feature map is represented by a vector with 120 elements.

Now for each cell, we will make predictions for $$k$$ anchors. Each anchor is represented by a vector of 2 + 4. 2 is number of object classes (object vs background). If we have $$M$$ object classes then this number will become $$M+1$$. 4 is the number of parameters in the bounding box parameterization. This means the output layer of RPN has size (height=100, width=100, output=6 * $$k$$) so $$100 \times 100 \times 6k$$ nodes, where $$k$$ is the number of anchors of a cell in the feature map.

The output layer of RPN can also be viewed as a ($$100 \times 100 \times k$$, 6) matrix. The number of rows is the number of anchors.

When we train the RPN, each image in the training data provides a mini-batch. In the paper, the author mentioned that "[...]we randomly sample 256 anchors in an image to compute the loss function of a mini-batch, where the sampled positive and negative anchors have a ratio of up to 1:1". It seems that this random sampling is used when we calculate the loss function so the whole image is still fed into the RPN. We could write the loss function in the following way

$$\textrm{loss of mini-batch} = \sum\limits_{anchor \in sample} loss(anchor)$$

Mask R-CNN is built on top of Faster R-CNN. There are two new features introduced in this architecture