# Tabular Reinforcement Learning

## Introduction

In this post, we discuss some classic algorithms in tabular reinforcement learning. Before we present the details, we want to highlight assumptions in tabular RL problems.

• There is no model of the environment. This means the MDP is not accessible to the agent. (If the MDP is available, then the learning problem becomes a planning problem.)
• The state and action spaces are discrete and finite.
• The feedback is exhaustive. In other words, we assume all state-action pairs must be explored infinitely often.
• The policy must converge on a greedy policy. This typically means we should use a decayed epsilon greedy strategy.

The above assumptions, except the first one, are called GLIE (Greedy in the limit with infinite exploration) and are used to guarantee the convergence of the calculated policies.

## Problem Formulation

Now let's take a look at what we have:

• An agent
• An environment

That's it.

The objective of reinforcement learning is to train the agent so that it can perform certain tasks in the environment. The interactions between the agent and the environment are

• actions performed by the agent, which have impacts on the environment.
• feedback provided by the environment as responses to the agent's action.

What's the input of the agent/RL system?

Agents receive two types of inputs:

• Measurement data of the environment. For example, distance to an obstacle.
• Reward provided by the environment.

Technically, there is no reward concept in an environment. Rewards are evaluative. They are measures of the goodness of the agent's actions. Of course, there must be an evaluator because someone needs to provide judgments and the evaluator is the designer of the agent. To some extent, the reward is a bridge between the problem and the solution. It's part of the problem because it captures the "interesting part" of the problem we want to solve; it's part of the solution because it's part of the inputs fed to the reinforcement learning algorithm.

What's the output of the agent/RL system? The output of the system is a policy; ideally, it's the optimal policy. Recall that a policy tells the agent which action it should choose under a given state and this is all we need to control the agent.

A policy is backed up by value functions. We have state value function and action value function. Because the agent does not have access to MDP, state value function alone doesn't allow the agent to construct a policy. Therefore, we use action value function $$Q$$ in most RL implementations. Given a Q-function, it's easy to construct a greedy policy or an epsilon greedy policy.

The remaining question is how we can build a Q-function.

## Generalized Policy Iteration

GPI is a framework that uses an iterative approach to compute the optimal policy. Each iteration consists of two steps:

• policy evaluation
• policy improvement

In the policy evaluation step, the objective is to estimate the expected return of states under the current (behavior) policy. In the policy improvement step, as the name suggests, the policy is improved. These two steps are repeated until the policy converges to the optimal policy.

### Policy Evaluation

The objective is to estimate the expected return of states under the current (behavior) policy. When we want to estimate the value of a variable, there aren't many things we can do. Either we build a model of the variable, which falls into the supervised learning category, or we can estimate the value through sampling.

In tabular RL, we use the idea of sampling to estimate the value function. We will see in the other post that deep RL uses an approach more like supervised learning.

#### Monte Carlo Method

Recall the definition of return:

$$G_{t:T} = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{T-t+1}R_{T}$$

which corresponds to a return at time step $$t$$ in an episode.

The Monte Carlo method is straightforward. Let the agent interact with the environment and collect sampling data. Each trajectory (which ends at a terminal state) represents an episode. In each episode, we visit many state-action pairs and we know the returns collected by those pairs because we have all the information for the calculation.

We may visit the same state-action pair multiple times in an episode. If we use the return from the first visit only, we get the First Visite Monte Carlo method; if we use the returns from every visit, we get the Every Visit Monte Carlo method.

To calculate $$Q(s,a)$$, we just take the average of returns from different episodes.

One of the drawbacks of MC method is that we need to wait until the end of an episode before we can update the Q-function.

#### Temporal Difference Method

The TD method uses $$R_{t+1} + \gamma V(S_{t+1})$$ or $$R_{t+1} + \gamma Q(S_{t+1}, A_{t+1})$$ to approximate $$G_{t:T}$$.

Note that this is an estimate based on another estimate, therefore, TD method is a bootstrapping method.

Unlike MC method, TD method doesn't need to wait until the end of an episode. In fact, there is no wait at all. Once the agent performs action $$a_{t}$$, the environment provides an experience that includes the next state $$s_{t+1}$$ and the reward $$R_{t+1}$$. Agent can use this information immediately to update the Q-function.

As we can see, MC method and TD method are two ends of a spectrum of algorithms. MC samples the whole episode and uses all the rewards while TD method only uses the first reward. Obviously, there are other similar algorithms that sit in between, which use $$N$$ rewards in an episode where $$N$$ ranges from 2 to $$\infty$$.

If we take a weighted average of all these approximate returns, we obtain the $$TD(\lambda)$$ method:

Note that $$TD(\lambda)$$ contains MC so we also need to wait until the end of an episode before updating the Q-function in a naive implementation. There is a technique called Eligibility Trace that mitigates this problem.

#### Model-Based Method

MC and TD are model-free methods because during the evaluation, the agent doesn't need a model of the environment and the agent collects the sampled data directly by interacting with the environment.

In a model-based method, the agent could benefit from having a model of the environment. Of course, the true model (i.e. MDP) of the environment is not accessible. Otherwise, we don't need to learn anything anymore. What the agent can do is to build a model of the environment by estimating $$p(s', r|s, a)$$. For example, a simple approach is to use count-based estimate of probability.

Once we build a model of the environment, the agent could simulate an experience instead of interacting with the environment. There are different ways to simulate experiences. We can simulate experiences independently or simulate a trajectory. The former approach gives Dyna-Q algorithm and the latter approach is called trajectory sampling.

### Policy Improvement

Policy is backed up by values function. Most of the time, the value function is the Q-function, i.e. action value function (because MDP is not available and V-function is not enough to construct a policy). Improving a policy means updating the Q-function.

Most policy improvement methods in RL take the general incremental form:

$$Q(s,a) = Q(s,a) + \alpha (\textrm{TargetValue} - Q(s,a))$$

where $$\alpha$$ is the learning rate.

Two classic RL algorithms are

$$\begin{eqnarray} \textrm{SARSA (on policy):} & \hspace{1cm} & \textrm{TargetValue} = R_{t+1} + \gamma Q(s,a) \\ \textrm{Q-learning (off policy):} & \hspace{1cm} & \textrm{TargetValue} = R_{t+1} + \gamma \max_{a} Q(s,a) \end{eqnarray}$$

In any RL algorithm, there are two policies involved:

• behavior policy: this is the policy used to generate sampling data. It's used to select an action that the agent performs to interact with the environment. Most of the time, this policy is a decayed epsilon green policy.
• target policy: this is the policy the agent wants to learn and it's the policy used when the agent updates the Q-function.

If the behavior policy and the target policy are the same, then the RL algorithm is on-policy; if they are different, the algorithm is off-policy.

Q-learning is off-policy because the behavior policy is an epsilon greedy policy while the target policy is a greedy policy.

One of the problems of Q-learning is that it tends to overestimate the value. The reason is that we always use the maximum value in the Q-function update process. An example given in the book is the following. Suppose the true value of all actions is zero. The estimated value of these actions might be positive or negative. For example, if we have 5 actions, we may have an estimate like [-1, 2, 3, -2, 1]. According to Q-learning, whenever there is a positive number in the action estimate array, the target value is positive. In other words, the negative numbers are ignored most of the time. That's why the final estimate of the action value is easily overestimated. There is an intrinsic bias in Q-learning.

(Quote from the book) One way to mitigate the issue is to use two Q-functions. At each time step, we choose one of them to determine the action (i.e. to determine which estimate is the highest according to that Q-function). Then we use the other Q-function to obtain that action's estimate. By doing this, there's a lower chance of always having a positive bias error. Then, to select an action for interacting with the environment, we use the average, or the sum, across the two Q-functions for that state.

## Summary

In this post, we take a look at tabular reinforcement learning algorithms. The general framework is the Generalized Policy Iteration. Each iteration consists of two steps: (1) policy evaluation and (2) policy improvement.

The objective of policy evaluation step is to estimate the expected return of states. The basic idea is Monte Carlo method: the agent interacts with the environment and collects data in multiple episodes. With all these data, the agent is able to estimate the expected return. However, MC requires that the agent needs to wait until the end of an episode before updating the value function and it's inconvenient. Temporal Difference method overcomes this issue by using incremental updates which use existing estimates to approximate the sampled return. TD is a bootstraping method because an estiamte is used to create another estimate. MC and TD are two ends of a spectrum of algorithms, by taking a weighted average, we get the $$TD(\lambda)$$ method.

In the policy improvement step, agents need to improve the policy by updating the Q-function. The update takes the general form $$Q(s,a) = Q(s,a) + \alpha (\textrm{TargetValue} - Q(s,a))$$. Depending on the target value, we can get SARSA and Q-learning algorithm. Combined with $$TD(\lambda)$$ method in the policy evaluation step and using the Eligibility Trace technique, we get SARSA($$\lambda$$) and Q($$\lambda$$) algorithm. The Q-learning algorithm often overestimates the expected value and one way to mitigate the issue is to use two Q-functions. This leads to the double Q-learning algorithm.

Although a model of the environment is not available, agents can build a model from experiences. Once this approximate model is available, agents can use it to simulate experiences. Depending on how agents simulate the experience, we get Dyna-Q algorithm or trajectory-sampling algorithm.

----- END -----

Welcome to join reddit self-learning community.

Want some fun stuff?