# Basics of Deep Learning

### Model Training

Deep learning is an optimization problem and the standard setup of an optimization problems is the following:

• Determine a cost function: This is the function we want to minimize. Neural network is no exception in this area. The choice of the cost function of course depends on the nature of the task.
• Define parameters: Parameters are called weights in neural network.
• Apply optimization algorithm: Gradient decent algorithm is widely used in deep learning.

Due to the multi-layer structure of neural network, the cost function evaluation is called forward propagation and the calculation of gradients is called backpropagation. Therefore, the training of a neural network follows the following steps in general:

1. Forward propagation
2. Error computation
3. Backpropagation
4. Parameter update

The following formula is used for parameter update:

$$\theta = \theta - \alpha \nabla_{\theta}E$$

where $$\alpha$$ is called learning rate, which impacts

• the speed of convergence of the gradient decent algorithm
• the quality of convergence of the gradient decent algorithm
• the ability for the gradient decent algorithm to converge

The general code structure used for training a neural network is the following

1
2
3
4
5
6
7
for epoch in range(numOfEpoch):
for subsetOfData in make_batch(X, Y):
input, target = subsetOfData
output = net(input)
error = lossFunction(output, target)
backpropagation(net, error, lossFunction)
updateParameter(net)


In the following sections, we briefly describe some of the most important aspects of each step in the training process.

### Activation Function

Activation function is used to introduce non-linearity to the model. The following activation functions are classic:

• sigmoid
• tanh
• ReLU

ReLU is arguably the most popular one because it does not suffer from the vanishing gradient problem.

### Loss Function

Loss function is a measure of the difference between the predicted value and the expected value. The predicted value is the output of the neural network and the expected value is provided by the training data.

The following metrics are commonly used in deep learning

• L2 norm (MSE)
• L1 norm (MAE)
• Cross Entropy and Softmax (This distance is used in classification problems.)

### Optimization Methods

Gradient descent algorithm is commonly used for optimization in deep learning. The basic idea of gradient decent is to update parameters based on the gradient.

$$\theta = \theta - \alpha \nabla_{\theta}E$$

Suppose the the MSE is used and there are N data point. The error is calculated by the formula below:

$$error = \frac{1}{2} \sum_{i = 1}^{k} (y - \hat{y})^2$$

$$\frac{\partial error}{\partial \theta} = - \sum_{i = 1}^{k} (y - \hat{y}) \frac{\partial \hat{y}}{\partial \theta}$$

Depending the value of $$k$$, we have

• Full-batch gradient decent if k == N. This approach is less noisy. However, it has low parameter update frequency, which leads to large variance between different epoch.
• Stochastic gradient decent if k == 1. Too noisy.
• Mini-batch gradient decent if k != N and k != 1. This approach is between Full-batch approach and SGD, so it's good. On the other hand, the batch size becomes a hyperparameter of the model.

One of the challenges in training the deep neural network is to adjust the learning rate $$\alpha$$. When parameters approach local minimums, the gradient becomes very small and the parameter updates become less efficient. That's why the learning rate needs to be adjusted during the training process. Here we list some of the most classic learning rate adjustment method:

• Momentum