# Introduction to Deep Reinforcement Learning

## Introduction

In this post, we present a brief introduction to deep reinforcement learning. The content of this post is based on chapters 8-11 in the book Grokking Deep Learning.

We first present the NFQ(Neural Fitted Q) learning method. It's a natural transition from tabular learning methods to deep reinforcement learning methods. However, the setup in NFQ violates assumptions used in neural network training such as IID distribution of data. Another issue of NFQ has is the moving target value. These problems are addressed in DQN (deep Q-network). Similar to double Q-learning, DQN architecture can be improved by having two neural networks, which leads to double DQN architecture. We then present an improvement implemented at the neural network level: the Dueling Network Architecture. An improvement of replay buffer is also discussed.

In the last section, we briefly talk about policy gradient methods and how we can implement them using a neural network.

## Neural Fitted Q (NFQ)

We are talking about deep reinforcement learning so we should expect to see a neural network somewhere in the solution. For the sake of simplicity, we could say neural network is a powerful tool that can approximate any function. Now we have the following decisions to make

• Which function to approximate? (V-function, Q-function, or A-function?)
• What's the neural network architecture to use? How do we represent states and actions in the neural network?
• What is the loss function?
• What should we optimize? Which optimization method should we use?

Recall in a tabular reinforcement learning algorithm, we have two steps in each iteration

• policy evaluation
• policy improvement

In the policy evaluation step, the agent estimates the expected return, which is part of the target value and is used in the policy improvement step. In the policy improvement step, the agent updates the Q-function values, which improves the policy. To update the Q-function values is just another way to say the agent tracks the target value as shown in the below formula:

$$Q(s,a) := Q(s,a) + \alpha(\textrm{targetValue} - Q(s,a))$$

In this sense, the problem becomes a supervised learning problem: we want to fit a model $$Q(s, a)$$ so that it's close to the sampled target values.

The second question is which neural network architecture to use. More importantly, how can we represent states and actions in the neural network? A common setup is to use the input layer to present states and to use the output layer to present actions. For example, a simple neural network architecture is given as follows:

The input layer consists of nodes representing states and the output layer consists of nodes representing action values.

The third question is what loss function should we use? A typical choice is the Euclidean distance between $$Q(s,a)$$ and the target value.

The last question is about the optimization method, which is not directly related to the reinforcement learning context. We will not provide details in this post.

Overall, NFQ is a smooth transition from tabular RL methods to deep RL. The implementation of NFQ shares a similar structure to the one used in tabular RL methods. The only major difference is that instead of updating the Q-function values directly, we pass the target value to the neural network as labeled data and use backpropagation to update the parameters in the neural network.

NFQ has many drawbacks. First and foremost, reinforcement learning is not supervised learning. In supervised learning, we assume the sampled data is independently and identically distributed. This is not true in reinforcement learning. Sampled data in a trajectory is not independent. Moreover, when the policy is improved, it also changes the data generation process, which in turn changes the distribution of the sampled data. There is another problem in NFQ. The target value is moving. When we improve the policy and update the Q-function, the target value is changed as a side effect because target value depends on the Q-function.

As we will see in the next section, these issues will be addressed in the deep Q-network.

## Deep Q-Network (DQN)

DQN aims to solve the two issues mentioned in the previous section

• the moving tagets
• the violation of the IID assumption

Let's focus on the first issue: the moving targets. The target values depend on the Q-function and in a stochastic environment, the target values are stochastic. The objective is to make these target values more static. One solution to achieve this goal is to use a separate neural network to calculate the target values. This neural network needs to slowly react to the environment. A simple solution is to freeze the neural network of Q-function and save its parameters as the target value neural network. In this way, we don't need to train two neural networks.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
t_target = float('-inf')

for t in N_STEPS:

# other code...

if t - t_target > LAG_BETWEEN_TARGET_AND_ONLINE_MODEL:
# Freeze the online neural network.
targetValueNeuralNetwork = onlineNeuralNetwork.copy()
t_target = t

targetValue = calculateTarget(targetValueNeuralNetwork, nextState)
q_sa = calculateValue(onlineNeuralNetwor, state, action)

# other code...



To solve the IID violation issue, we can introduce a data structure called replay buffer. The idea is that instead of using the immediate experience provided by the environment, we store them in the replay buffer. Later, we draw samples from the replay buffer and use the sampled data to train the neural network. Data in the replay buffer is more independent because they come from different sources: different trajectories and different data generation processes because the policies are different.

## Double DQN

DQN is still a Q-learning method. Therefore, it tends to overestimate values. In double Q-learning, we use two Q-functions; similarly we can have two neural networks to represent these two Q-functions. Recall that in DQN, we already have two neural networks: one for the online model and the other for the target value estimate. In double DQN, we will leverage this structure in DQN.

A quick review of the issue of overestimation. The problem is that we use the following formula to calculate the target value in Q-learning:

$$R_{t+1} + \gamma \max_{a} Q(S_{t+1}, a)$$

and $$max$$ operator introduces a positive bias. The solution implemented in double DQN is based on the observation that this $$max$$ operator consists of two steps:

$$\begin{eqnarray} (1) \;\; & a^{*} = \operatorname*{argmax}_{a} Q(S_{t+1}, a) \\ (2) \;\; & \max_{a} Q(S_{t+1}, a) = Q(a^{*}) \end{eqnarray}$$

Notice that the first step is asking the Q-function to recommend action and the second step is asking the Q-function to put a price/value on the recommended action. The bias is caused by the fact the Q-function plays two roles with "conflict of interest" at the same time. The solution is simple: instead of having a single Q-function (in the case of DQN, a neural network) play two roles, we should have two Q-functions/neural networks and each of them plays a single role. We already have a neural network for target value estimate and a neural network to represent the online mode, so the question is which one should give the recommendation and which one should put a value on the recommended action.

This is not hard. The target value neural network should be responsible for the estimate because this is exactly why it's called a target value neural network in the first place and this estimate is part of the target value. The online model neural network should be responsible for recommending an action because (1) this is its job and (2) it has access to the most recent feedback from the environment, which means it has more up-to-date information.

Put everything together, in double DQN we have

$$\begin{eqnarray} (1) \;\; & & a^{*} = \operatorname*{argmax}_{a} Q_{online}(S_{t+1}, a) \\ (2) \;\; & & \textrm{targetValue} = R_{t+1} + \gamma Q_{target}(a^{*}) \end{eqnarray}$$

## Dueling Network Architecture

Dueling network architecture is an improvement at the neural network level and does not have an impact at the algorithm level. In other words, it's an implementation detail of the neural network.

The idea is simple: we want to highlight the fact that the action values $$Q(s,a_1)$$, $$Q(s, a_2)$$, ..., $$Q(s, a_m)$$ have a common factor which is the state value $$v(s)$$.

This relationship is not explicitly represented in DQN. In DQN, $$Q(s,a)$$ is represented by the output layer of the neural network and all we are saying is that the action values $$Q(s,a_1)$$, $$Q(s, a_2)$$, ..., $$Q(s, a_m)$$ depend on a common factor (because we have links between the hidden layer and the output layer).

Mathematically, we can split the action value function into a state value function and an advantage function:

$$Q(s,a) = V(s) + A(s,a)$$

and the neural network becomes:

There is a small issue here. We have $$m$$ actions and in the output layer we have $$m$$ nodes. However, after the split, we now have $$m$$ advantage nodes plus a state value node. Mathematically, we have $$m$$ equations and $$m+1$$ variables, which means we have an infinite number of solutions (because we can always add an offset to the state value and remove the same offset from the advantage value). Therefore, we need an additional constraint to have a unique solution. The constraint implemented in the dueling network architecture is:

$$\sum_{i=1}^{m} A(s, a_i) = 0$$

Now to construct the nodes in the output layer, we should do

$$q(s, a_{i}) = v(s) + (a(s, a_{i}) - \frac{1}{m} \sum_{k=1}^{m} a(s, a_{k}))$$

where $$v(s)$$ is the state value node and $$a(s,a_{i})$$ is the advantage value node.

We can simplify the above expression because $$s$$ is provided by the input layer and the neural network itself doesn't need to have the concept of action. This gives us:

$$q_i = v + (a_{i} - \frac{1}{m} \sum_{k=1}^{m} a_{k}) \phantom{.....} i = 1, 2, ..., m$$

As we can see here the additional constraint causes a shift in the action value. This shift is a bias in the action value estimate, however, it doesn't impact the ranking of the actions.

## Prioritized Experience Replay

A naive implementation of replay buffer is to sample experiences uniformly. However, not all experiences are created equal. Some experiences are more important than others thus they should be preferred in some way during the training.

There are two questions here:

• How can we characterize the importance of an experience?
• How do we do the sampling?

The importance of an experience can be characterized by the absolute value of the TD error. To sample experiences, we can use a technique called importance sampling.

Conceptually, we have many instances of experience in the replay buffer, and each of them is associated with a value that can characterize the importance of the experience or the information contained in the experience. For example, in the diagram below, we have 4 experience instances in the replay buffer and E3 instance has the most information.

Now let's take a look at what we want and what we don't want. For starters, we don't want to sample instances uniformly because we know E3 is more important and has more value to offer. It's obvious that we want to see E3 more in the sampling. One way to achieve that is to sample instances based on their importance or value. For example, if we draw samples uniformly, the probability of E3 being selected is 1/4; if we draw samples based on importance, the probability of E3 being selected is 5/9.

This approach introduces bias to the system for an obvious reason: the distribution of experiences is changed. When the agent learns from the sampled experiences, it will overestimate experiences with relatively higher importance because from the agent's perspective, those experiences happen more frequently than in reality.

To compensate for this bias, when the agent receives a sampled experience instance, instead of consuming all the information contained in the experience, it can consume a fraction of the info. A fraction of information is implemented as a weighted target value and the weight should be inversely proportional (in a broad sense) to the importance of the experience.

Conceptually, we change the content in the replay buffer to the following form and then sample uniformly from the modified replay buffer.

## Policy-gradient and Actor-critic Methods

DQN is a value-based model because we use a neural network to represent the value function. As the agent learns from experiences, it updates the parameters in the neural network, which eventually converges to the real value function. From there, we can build a policy from this neural network.

It turns out we can learn the policy directly. This type of algorithm is called the policy-gradient method. Roughly speaking, there are two steps to implement a policy-gradient method:

• Choose a model of the policy function $$\Pi(s,a; \theta)$$, where $$\theta$$ is the parameter of the function.
• Learn from experiences and update parameters

Two questions remain:

• How do we update parameters?
• How do we know the function will eventually converge to the real policy?

The second question is too theoretical so we will just focus on parameter updates in this post.

Recall that a reinforcement learning program is an optimization problem. We want to maximize the expected return. Our objective function is

$$\mathbb{E}_{\Pi_{\theta}}[R(s,a)]$$

Following the standard gradient decent algorithm, we can update the parameters using the formula below:

$$\theta_{t+1} = \theta_{t} + \alpha \nabla_{\theta} \mathbb{E}_{\Pi_{\theta}}[R(s,a)]$$

This formula is not practical though because we cannot calculate $$\nabla \mathbb{E}_{\Pi_{\theta}}[R(s,a)]$$. However, it can be shown that

$$\nabla_{\theta} \mathbb{E}_{\Pi_{\theta}}[R(s,a)] = \mathbb{E}_{\Pi_{\theta}}[R(s,a) \nabla_{\theta} \log \Pi_{\theta}(s,a)]$$

The right side of the equation is easier to calculate (or estimate). We can sample experiences and take the average of sampled $$R(s,a) \nabla_{\theta} \log \Pi_{\theta}(s,a)$$.

Put everything together, we have the following formula to update the parameters of the policy:

$$\begin{eqnarray} \theta_{t+1} & = & \theta_{t} + \alpha \nabla_{\theta} \mathbb{E}_{\Pi_{\theta}}[R(s,a)] \\ & = & \theta_{t} + \alpha \; \textrm{sampledAvgOf} \left\{ \nabla_{\theta} R_{t+1} \log \Pi_{\theta}(A_t|S_t) \right\} \\ & = & \theta_{t} + \alpha \; \nabla_{\theta} \left( \; \textrm{sampledAvgOf} \left\{ R_{t+1} \log \Pi_{\theta}(A_t|S_t) \right\} \; \right)\\ \end{eqnarray}$$

The policy-gradient method described above is a general framework. We can choose any form to represent the policy. For example, we can use a neural network to represent the policy. The input layer consists of nodes that represent states and the output layer consists of nodes that represent the probability of the action being selected. To train a neural network, we also need to specify a loss function. Recall that what backpropagation does is to update parameters in the neural network using the following formula

$$\theta_{t+1} = \theta_{t} + \alpha \nabla L$$

where $$\alpha$$ is the learning rate and $$L$$ is the loss function. Therefore, if we use a policy neural network, the loss function should be $$R_{t+1} \log \textrm{OutputNode}(A_t)$$.

### Actor-critic Methods

It can be shown that

$$\mathbb{E}[b(S_t)\nabla_{\theta} \log \Pi_{\theta}(A_t|S_t)] = 0$$

This means we can include a baseline value in the parameter update formula:

$$\theta_{t+1} = \theta_{t} + \alpha (R_{t+1} - b(S_t)) \nabla_{\theta} \log \Pi_{\theta}(A_t|S_t)$$

Actor-critic methods use a neural network to estimate the value of a state, which is $$b(s)$$ in the above expression. It's called actor-critic method because the policy selects actions and thus acts as an actor while the value function evaluates the current policy and thus is considered a critic.

----- END -----

Welcome to join reddit self-learning community.

Want some fun stuff?