Reinforcement Learning - Glossary

Subscribe Send me a message home page tags

#reinforcement learning 


Reward refers to the one-step reward signal the agent gets: the agent observes a state, selects an action, and receives a reward signal. The reward signal is the core of RL, but it is not what the agent is trying to maximize! Again, the agent isn't trying to maximize the reward! Realize that while your agent maximizes the one-step reward, in the long-term, it's getting less than it could.


Return refers to the total discounted rewards. Returns are calculated from any state and usually go until the end of the episode. That is, when a terminal state is reached, the calculation stops.

Returns are often referred to as total reward, cumulative rewards, sum of rewards, and are commonly discounted: total discounted reward, cumulative discounted reward, sum of discounted reward. But, it's basically the same: a return tells you how much reward your agent obtained in an episode. As you can see, returns are better indicators of performance because they contain a long-term sequence, a single-episode history of rewards. But the return isn't what an agent tries to maximize, either! An agent who attempts to obtain the highest possible return may find a policy that takes it through a noisy path; sometimes, this path will provide a high return, but perhaps most of the time a low one.

Value Function

Value function refers to the expectation of returns. Sure, we want high returns, but high in expectation. If the agent is in a noisy environment, or if the agent is using a stochastic policy, it's all fine. The agent is trying to maximize the expected total discounted reward, after all: value functions.

Incremental Methods

Incremental methods refer to the iterative improvement of the estimates. Dynamic programming is an incremental method: these algorithms iteratively compute the answers. They don't "interact" with an environment, but they reach the answers through successive iterations, incrementally.

Reinforcement learning is incremental. Depending on the specific algorithm, estimates are improved on an either per-episode or per-time-step basis, incrementally.

Sequential Methods

Sequential methods refer to learning in an environment with more than one non-terminal (and reachable) state. Dynamic programming is a sequential method.

Trial-and-error Methods

Trial-and-error methods refer to learning from interaction with the environment. Dynamic programming is not trial-and-error learning.

True Value Function

True value function refers to the exact and perfectly accurate value function, as if given by an oracle. The true value function is the value function agents estimate through samples. If we had the true value function, we could easily estimate returns.

Actual Return

Actual return refers to the experienced return, as opposed to an estimated return. Agents can only experience actual returns, but they can use estimated value functions to estimate returns. Actual return refers to the full experienced return.

Estimated Value Function or Estimated Return

It refers to the rough calculation of the true value function or actual return. "Estimated" means an approximation, a guess. True value functions let you estimate returns, and estimated value functions add bias to those estimates.

Planning Problems

Planning problems refer to problems in which a model of the environment is available and thus, there's no learning required. These types of problems can be solved with planning methods such as value iteration and policy iteration. The goal in these types of problems is to find, as opposed to learn, optimal policies.

Learning Problems

Learning problems refer to problems in which learning from samples is required, usually because there isn't a model of the environment available or perhaps because it's impossible to create one. The main challenge of learning problems is that we estimate using samples, and samples can have high variance, which means they'll be of poor quality and difficult to learn from. Samples can also be biased, either because of being from a different distribution than the one estimating or because of using estimates to estimate, which can make our estimates incorrect altogether.

Non-interactive Learning Problems

Non-interactive learning problems refer to a type of learning problem in which there is no need for or possibility of interacting with an environment. In these types of problems, there's no interaction with an environment while learning, but there is learning from data previously generated.

The objective is to find something given the samples, usually a policy, but not necessarily. For instance, in inverse RL, the objective is to recover the reward function given expert-behavior samples. In apprenticeship learning, the objective is to go from this recovered reward function to a policy. In behavioral cloning, which is a form of imitation learning, the goal is to go from expert-behavior samples directly to policies using supervised learning.

Note: non-interactive learning seems to leverage "replay" to learn.

Interactive Learning Problems

Interactive learning problems refer to a type of learning problem in which learning and interaction are interleaved. The interesting aspect of these problems is that the learner also controls the data-gathering process. Optimal learning from samples is one challenge, and finding samples for optimal learning is another.

Greedy Policy

Greedy policy refers to a policy that always selects the actions believed to yield the highest expected return from each and every state. It's essential to know that a "greedy policy" is greedy with respect to a value function. The "believed" part comes from the value function. The insight here is that when someone says, "the greedy policy", you must ask, greedy with respect to what? A greedy policy with respect to a random value function is a pretty bad policy.

Epsilon-greedy Policy

It refers to a policy that often selects the actions believed to yield the highest expected return from each and every state. Same as before applies; an epsilon-greedy policy is epsilon-greedy with respect to a specific value function. Always make sure you understand which value function is being referenced.

Optimal Policy

Optimal policy refers to a policy that always selects the actions actually yielding the highest expected return from each and every state. While a greedy policy may or may not be an optimal policy, an optimal policy must undoubtedly be a greedy policy. An optimal policy is a greedy policy with respect to a unique value function, the optimal value function.

Generalized Policy Iteration (GPI)

GPI is a general idea that the continuous interaction of policy evaluation and policy improvement drives policies towards optimality.

Prediction Problem

It refers to the problem of evaluating policies, of estimating value functions given a policy. Estimating value functions is nothing but learning to predict returns. State-value functions estimate expected returns from states, and action-value functions estimate expected returns from state-action pairs.

Control Problem

It refers to the problem of finding optimal policies. The control problem is usually solved by following the pattern of generalized policy iteration (GPI), where the competing processes of policy evaluation and policy improvement progressively move policies towards optimality. Reinforcement learning methods often pair an action-value prediction method with policy improvement and action-selection strategies.

Policy Evaluation

Policy evaluation refers to algorithms that solve the prediction problem. Note that there's a dynamic programming method called policy evaluation, but this term is also used to refer to all algorithms that solve the prediction problem.

Policy Improvement

Policy improvement refers to algorithms that make new policies that improve on an original policy by making it greedier than the original with respect to the value function of that original policy. Note that policy improvement by itself doesn't solve the control problem. Often a policy evaluation must be paired with a policy improvement to solve the control problem. Policy improvement only refers to the computation for improving a policy given its evaluation results.

(Comment: To be honest, this paragraph is confusing. How can we make a policy greedier if the value function is the same? Notice that during the policy improvement step, the estimate of the value function is updated before the policy is improved.)

Batch Learning Problems and Methods

When you hear the term "batch learning", people are referring to one of two things

Batch learning methods are typically studied with non-interactive learning problems, more specifically, batch learning problems. But batch learning methods can also be applied to interactive learning problems. For instance, growing batch methods are batch learning methods that also collect data: they "grow" the batch. Also, batch learning problems don't have to be solved with batch learning methods, the same way that batch learning methods aren't designed exclusively to solve batch learning problems.

Offline Learning Problems and Methods

When you hear the term "offline learning", people are usually referring to one of two things:

Note that, in offline learning methods, learning and interaction can still be interleaved, but performance is only optimized after samples have been collected, similar to the growing batch described previously, but with the difference that, unlike growing batch methods, offline methods commonly discard old samples; they don't grow a batch. Monte Carlo methods, for instance, are often considered offline because learning and interaction are interleaved on an episode-to-episode basis. There are two distinct phases, interacting and learning; Monte Carlo method is interactive, but also an offline learning method.

Online Learning Problems and Methods

When you hear the term "online learning", people are referring to one of two things:

On-policy Learning

On-policy learning refers to methods that attempt to evaluate or improve the policy used to make decisions. It is straightforward; think about a single policy. This policy generates behavior. Your agent evaluates that behavior and selects areas of improvement based on those estimates. Your agent learns to assess and improve the same policy it uses for generating the data.

Off-policy Learning

Off-policy learning refers to methods that attempt to evaluate or improve a policy different from the one used to generate the data. This one is more complex. Think about two policies. One produces the data, the experiences, the behavior; but your agent uses that data to evaluate, improve, and overall learn about a different policy, a different behavior. Your agent learns to assess and improve a policy different than the one used for generating the data.


Planning refers to algorithms that require a model of the environment to produce a policy. Planning methods can be of state-space planning type, which means they use the state space to find a policy, or they can be of plan-space planning type, meaning they search in the space of all possible plans(think about genetic algorithms).

Model-free Reinforcement Learning

It refers to algorithms that don't use models of the environments but are still able to produce a policy. The unique characteristic here is these methods obtain policies without the use of a map, a model, or an MDP. Instead, they use trial-and-error learning to obtain policies. Several examples of model-free RL algorithms are MC, SARSA, and Q-learning.

Model-based Reinforcement Learning

It refers to algorithms that can learn, but don't require, a model of the environment to produce a policy. The distinction is they don't require models in advance, but can certainly make good use of them if available, and more importantly, attempt to learn the models through interaction with the environment. Several examples of model-based RL algorithms are Dyna-Q and trajectory sampling.

----- END -----