Model Training
Deep learning is an optimization problem and the standard setup of an optimization problems is the following:
- Determine a cost function: This is the function we want to minimize. Neural network is no exception in this area. The choice of the cost function of course depends on the nature of the task.
- Define parameters: Parameters are called weights in neural network.
- Apply optimization algorithm: Gradient decent algorithm is widely used in deep learning.
Due to the multi-layer structure of neural network, the cost function evaluation is called forward propagation and the calculation of gradients is called backpropagation. Therefore, the training of a neural network follows the following steps in general:
- Forward propagation
- Error computation
- Backpropagation
- Parameter update
The following formula is used for parameter update:
where \(\alpha\) is called learning rate, which impacts
- the speed of convergence of the gradient decent algorithm
- the quality of convergence of the gradient decent algorithm
- the ability for the gradient decent algorithm to converge
The general code structure used for training a neural network is the following
for epoch in range(numOfEpoch):
for subsetOfData in make_batch(X, Y):
input, target = subsetOfData
output = net(input)
error = lossFunction(output, target)
backpropagation(net, error, lossFunction)
updateParameter(net)
In the following sections, we briefly describe some of the most important aspects of each step in the training process.
Activation Function
Activation function is used to introduce non-linearity to the model.

The following activation functions are classic:
- sigmoid
- tanh
- ReLU
ReLU is arguably the most popular one because it does not suffer from the vanishing gradient problem.
Loss Function
Loss function is a measure of the difference between the predicted value and the expected value. The predicted value is the output of the neural network and the expected value is provided by the training data.
The following metrics are commonly used in deep learning
- L2 norm (MSE)
- L1 norm (MAE)
- Cross Entropy and Softmax (This distance is used in classification problems.)
Optimization Methods
Gradient Decent Algorithm
Gradient descent algorithm is commonly used for optimization in deep learning. The basic idea of gradient decent is to update parameters based on the gradient.
Suppose the the MSE is used and there are N data point. The error is calculated by the formula below:
The gradient is given by
Depending the value of \(k\), we have
- Full-batch gradient decent if k == N. This approach is less noisy. However, it has low parameter update frequency, which leads to large variance between different epoch.
- Stochastic gradient decent if k == 1. Too noisy.
- Mini-batch gradient decent if k != N and k != 1. This approach is between Full-batch approach and SGD, so it's good. On the other hand, the batch size becomes a hyperparameter of the model.
Learning Rate Adjustment
One of the challenges in training the deep neural network is to adjust the learning rate \(\alpha\). When parameters approach local minimums, the gradient becomes very small and the parameter updates become less efficient. That's why the learning rate needs to be adjusted during the training process. Here we list some of the most classic learning rate adjustment method:
- Momentum
- Adagrad
- ADAM
Regularization
Another challenge when we train a deep neural network is overfitting. Overfitting is a phenomenon that the model has small error in training data but it does not generalize well when applied to other data set. One of the reasons that overfitting happens often in deep learning model training is that deep neural network has large number of parameters. The following methods can be used to mitigate the issue:
- Weight decay (regularization)
- Dropout
- Parameter sharing
We can also apply the "early stopping" method. The idea is that if we see a consistent increase of errors in validation data we should stop the optimization.
----- END -----
©2019 - 2024 all rights reserved