Basics of Deep Learning

Subscribe Send me a message home page

Model Training

Deep learning is an optimization problem and the standard setup of an optimization problems is the following:

Due to the multi-layer structure of neural network, the cost function evaluation is called forward propagation and the calculation of gradients is called backpropagation. Therefore, the training of a neural network follows the following steps in general:

  1. Forward propagation
  2. Error computation
  3. Backpropagation
  4. Parameter update

The following formula is used for parameter update:

$$ \theta = \theta - \alpha \nabla_{\theta}E $$

where \(\alpha\) is called learning rate, which impacts

The general code structure used for training a neural network is the following

for epoch in range(numOfEpoch):
    for subsetOfData in make_batch(X, Y):
        input, target = subsetOfData
        output = net(input)
        error = lossFunction(output, target)
        backpropagation(net, error, lossFunction)

In the following sections, we briefly describe some of the most important aspects of each step in the training process.

Activation Function

Activation function is used to introduce non-linearity to the model.

The following activation functions are classic:

ReLU is arguably the most popular one because it does not suffer from the vanishing gradient problem.

Loss Function

Loss function is a measure of the difference between the predicted value and the expected value. The predicted value is the output of the neural network and the expected value is provided by the training data.

The following metrics are commonly used in deep learning

Optimization Methods

Gradient Decent Algorithm

Gradient descent algorithm is commonly used for optimization in deep learning. The basic idea of gradient decent is to update parameters based on the gradient.

$$ \theta = \theta - \alpha \nabla_{\theta}E $$

Suppose the the MSE is used and there are N data point. The error is calculated by the formula below:

$$ error = \frac{1}{2} \sum_{i = 1}^{k} (y - \hat{y})^2 $$

The gradient is given by

$$ \frac{\partial error}{\partial \theta} = - \sum_{i = 1}^{k} (y - \hat{y}) \frac{\partial \hat{y}}{\partial \theta} $$

Depending the value of \(k\), we have

Learning Rate Adjustment

One of the challenges in training the deep neural network is to adjust the learning rate \(\alpha\). When parameters approach local minimums, the gradient becomes very small and the parameter updates become less efficient. That's why the learning rate needs to be adjusted during the training process. Here we list some of the most classic learning rate adjustment method:


Another challenge when we train a deep neural network is overfitting. Overfitting is a phenomenon that the model has small error in training data but it does not generalize well when applied to other data set. One of the reasons that overfitting happens often in deep learning model training is that deep neural network has large number of parameters. The following methods can be used to mitigate the issue:

We can also apply the "early stopping" method. The idea is that if we see a consistent increase of errors in validation data we should stop the optimization.

----- END -----

Send me a message Subscribe to blog updates

Want some fun stuff?