Deep learning - Initiate a thin network from a wider pre-trained network model
Models such as VGG-16 have huge number of parameters and it is not practical to train those models on a home computer. However, it's a known fact that using pre-trained model can help improve the performance. A natural idea is to reduce a large pre-trained model to a reasonable size so that it becomes possible to train it on an ordinary computer.
In Fully-adaptive Feature Sharing in Multi-Task Networks with Applications in Person Attribute Classification, the authors proposed a method to build a thin network from a pre-trained network model.
First, we can have a unified way to represent layers in a neural network. Let \(x^l\) and \(y^l\) denote the input and output of layer \(l\), then both fully connected layer and convolutional layer can be represented by
where \(W^l\) is the parameter matrix of layer \(l\).
Our objective is to reduce the size of the network, which can be done by selecting a subset of it. A neural network is characterized by its weights so that question becomes how we can select a subset of weights. Given the structure of neural networks, it's not surprising that the authors proposed a layer-by-layer approach.
The main idea is that we start from the first layer and select a subset of rows in the parameter matrix. Note that a selection of rows in the weight matrix corresponds to a selection of elements in the layer output.
For example, the figure below shows the weight matrix and the input of layer \(l\). As we can see, the input dimension is 5 and the output dimension is 7. To reduce the size of this layer, we could select a subset of rows in the weight matrix. The selected parts are highlighted in blue in the figure below.
In our example, three rows are selected, which means the new weight matrix of layer \(l\) becomes \(3 \times 5\) and the new layer output has 3 dimensions. Now look at the layer \(l+1\). The input of this layer is the output of the previous layer. As we know the new output of layer \(l\) has 3 dimensions, we need to adjust the weight matrix in layer \(l+1\) by selecting the correspondent columns. This gives us a new weight matrix for layer \(l+1\) and we can apply the same row-selection process mentioned earlier to it.
In summary, the row-selection is used to reduce the size of a layer and the column selection is an adjustment of dimension.
The remaining question is how we select the rows in the weight matrix. In the paper, a method called greedy Simultaneous Orthogonal Matching Pursuit(SOMP) is used. We could find more details in the paper ALGORITHMS FOR SIMULTANEOoUS SPARSE APPROXIMATIONPART I: GREEDY PURSUIT.
Pointers in the Paper
Methods for compressing and accelerating convolutional networks include
- knowledge distillation
- Training CNNs with Low-Rank Filters for Efficient Image Classification
- Convolutional neural networks with low-rank regularization
- LOW-RANK MATRIX FACTORIZATION FOR DEEP NEURAL NETWORKTRAINING WITH HIGH-DIMENSIONAL OUTPUT TARGETS
- pruning and quantization
- Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding
- Channel-Level Acceleration ofDeep Face Representations
- structure matrices
- An exploration of parameter redundancy in deep networks with circulant projections
- Structured Transforms for Small-Footprint Deep Learning
- Tamp: A Library for Compact Deep Neural Networks with Structured Matrices
- dynamic capacity networks
----- END -----
©2019 - 2021 all rights reserved