Object Detection in Images

Subscribe Send me a message home page tags

#deep learning  #machine learning  #image processing 


This post is a brief summary of three algorithms for object detection and segmentation:

  1. Fast R-CNN
  2. Faster R-CNN
  3. Mask R-CNN

All of them are region based algorithm. The following concepts are involved in the discussion and paper

Related Reading

Fast R-CNN

The input of the fast R-CNN are

In Fast R-CNN, the location proposals are calculated using traditional computer vision technique Selective Search. The figure below shows the architecture of fast R-CNN.


Faster R-CNN

The bottleneck of Fats R-CNN is the calculation of location proposals. The idea of Faster R-CNN is to replace Selective Search algorithm with a neural network called Region Proposal Network (RPN). In the paper, the author also define a notion called anchor. Anchor is essentially a tuple (position_of_bounding_box_center, scale_of_bounding_box, aspect_ratio_of_bounding_box) and it represents a bounding box on the original image.

Note that regions proposed by Selective Search are regions which are believed to have an object by the algorithm. Therefore, when we connect RPN to Fast R-CNN we should use regions(or more precisely anchors) that are likely to have objects as well.


A close look at RPN

Suppose the size of the original image is (height=800, width=800). Now the input of RPN is the convolution feature map. We usually have max pooling in the convolutional network so for simplicity, we assume the feature map is of size (height=100, width=100, feature=256). According to the paper, we will add a conv network with \(3 \times 3\) sliding window. We assume the output layer of this conv network is (height=100, width=100, feature=120). Now we will use this output layer to predict if there is an object in the anchors and the location of the bounding box.

A brief summary of what we have so far. The feature map has size (height=100, width=100). Each cell in the feature map is represented by a vector with 120 elements.

Now for each cell, we will make predictions for \(k\) anchors. Each anchor is represented by a vector of 2 + 4. 2 is number of object classes (object vs background). If we have \(M\) object classes then this number will become \(M+1\). 4 is the number of parameters in the bounding box parameterization. This means the output layer of RPN has size (height=100, width=100, output=6 * \(k\)) so \(100 \times 100 \times 6k\) nodes, where \(k\) is the number of anchors of a cell in the feature map.

The output layer of RPN can also be viewed as a (\(100 \times 100 \times k\), 6) matrix. The number of rows is the number of anchors.

When we train the RPN, each image in the training data provides a mini-batch. In the paper, the author mentioned that "[...]we randomly sample 256 anchors in an image to compute the loss function of a mini-batch, where the sampled positive and negative anchors have a ratio of up to 1:1". It seems that this random sampling is used when we calculate the loss function so the whole image is still fed into the RPN. We could write the loss function in the following way

$$ \textrm{loss of mini-batch} = \sum\limits_{anchor \in sample} loss(anchor) $$

Mask R-CNN

Mask R-CNN is built on top of Faster R-CNN. There are two new features introduced in this architecture

The standard ROI pooling workflow is illustrated by the figure below:


Note that an image is represented by a matrix and the indexes are integer. Technically, we should use a dot to represent a pixel in an image. In Faster R-CNN, pixels in the projected ROI coincide with pixels of the feature map. On the contrary, with ROI alignment, pixels in the projected ROI don't need to coincide with pixels of the feature map. This is illustrated by the figure on the right side. The orange dots are the value in the ROI pooling layer and they are in the blank area of the dotted paper, which means we cannot get the value for orange dots from the feature map directly. To calculate the value of orange dotes, we can use bilinear interpolation, which is a weighted average of the surrounding dots.


----- END -----