Deep learning - Initiate a thin network from a wider pre-trained network model

Subscribe Send me a message home page tags


Models such as VGG-16 have huge number of parameters and it is not practical to train those models on a home computer. However, it's a known fact that using pre-trained model can help improve the performance. A natural idea is to reduce a large pre-trained model to a reasonable size so that it becomes possible to train it on an ordinary computer.

In Fully-adaptive Feature Sharing in Multi-Task Networks with Applications in Person Attribute Classification, the authors proposed a method to build a thin network from a pre-trained network model.

First, we can have a unified way to represent layers in a neural network. Let \(x^l\) and \(y^l\) denote the input and output of layer \(l\), then both fully connected layer and convolutional layer can be represented by

$$ y^l = \sigma_{l} (P(W^l) x^l) $$

where \(W^l\) is the parameter matrix of layer \(l\).

Our objective is to reduce the size of the network, which can be done by selecting a subset of it. A neural network is characterized by its weights so that question becomes how we can select a subset of weights. Given the structure of neural networks, it's not surprising that the authors proposed a layer-by-layer approach.

The main idea is that we start from the first layer and select a subset of rows in the parameter matrix. Note that a selection of rows in the weight matrix corresponds to a selection of elements in the layer output.

For example, the figure below shows the weight matrix and the input of layer \(l\). As we can see, the input dimension is 5 and the output dimension is 7. To reduce the size of this layer, we could select a subset of rows in the weight matrix. The selected parts are highlighted in blue in the figure below.

layer_L.png

In our example, three rows are selected, which means the new weight matrix of layer \(l\) becomes \(3 \times 5\) and the new layer output has 3 dimensions. Now look at the layer \(l+1\). The input of this layer is the output of the previous layer. As we know the new output of layer \(l\) has 3 dimensions, we need to adjust the weight matrix in layer \(l+1\) by selecting the correspondent columns. This gives us a new weight matrix for layer \(l+1\) and we can apply the same row-selection process mentioned earlier to it.

layer_L_plus_1.png

In summary, the row-selection is used to reduce the size of a layer and the column selection is an adjustment of dimension.

The remaining question is how we select the rows in the weight matrix. In the paper, a method called greedy Simultaneous Orthogonal Matching Pursuit(SOMP) is used. We could find more details in the paper ALGORITHMS FOR SIMULTANEOoUS SPARSE APPROXIMATIONPART I: GREEDY PURSUIT.

Pointers in the Paper

Methods for compressing and accelerating convolutional networks include

----- END -----

Welcome to join reddit self-learning community.
Send me a message Subscribe to blog updates

Want some fun stuff?

/static/shopping_demo.png