# Probabilistic Robotic Reading Note - Partially Observable Markov Decision Processes

## Introduction

In this post, we provide a step-by-step explanation of the Partially Observable Markov Decision Processes algorithm. We only cover the discrete case, meaning that the dimension of the belief space, the number of possible sensor data and the control are assumed to be all finite.

Recall that the objective of POMDP algorithm is to compute a value function that evaluates belief in horizon $$T$$. Like other value iteration algorithm, POMDP is also recursive. However, for simplicity purposes we will not cover the recursive aspect of the algorithm in this post.

## Explanation of Main Steps

### 1. Reward function of control with given state $$r(x,u)$$

The building block or the starting point of POMDP is the reward function. It's an evaluation of applying control $$u$$ when the robot is in state $$x$$.

### 2. Representation of belief.

As we only discuss the discrete case of POMDP in this post, we assume the dimension of the belief space is finite. Let $$N$$ denote the dimension of the belief space, then a belief can be characterized as avector of length $$N$$:

$$b = (p_1, p_2, ..., p_N)$$

Note that $$N$$ is also the nubmer of states of the robot.

### 3. Reward function of control with given belief $$r(b,u)$$

We mentioned earlier in section 1, the reward function $$r(x,u)$$ is an evaluation of applying control $$u$$ when the robot is in state $$x$$. Similarly, we can also define a reward function that evaluates the control $$u$$ when the belief of the robot is $$b$$. The state of the robot is not deterministic, therefore it is not surprising to see the reward function of the belief is a sort of expectation. Mathematically, we can write

$$r(b,u) = E_x[r(x,u)] = \sum_{i=1}^{N}r(x_i, u)p_i$$

In order to highlight the fact this is a function of $$(p_1, p_2, ..., p_N)$$, we can write the function argument explicitly.

$$\mathbf{r(b,u)}(p_1, p_2, ..., p_N) = \sum_{i=1}^{N}r(x_i, u)p_i$$

Note that the reward function is linear with respect to its arguments.

### 4. Value function $$V(b)$$

Value function is used to evalute a given belief. Intuitively, the value of a blief should be the maximum reward we can get when we hold that specific belief. Mathematically, we define the value function as

$$V(b) = \max_u r(b,u)$$

Again, the arguments of the value function is $$(p_1, p_2, ..., p_N)$$. By making the arguments explicit, we get

$$\mathbf{V(b)}(p_1, p_2, ..., p_N) = \max_u \left\{\mathbf{r(b,u)}(p_1, p_2, ..., p_N)\right\}$$

As shown in the book, the value function of belief is piecewise linear.

### 5. Effect of sensing

When the robot receives sensor data, it can update its belief. The new belief contains more information, thus is more accurate. To some extent, incorperating sensor data is a sort of value remapping. Again, we are dealing with discrete case so we assume there is only $$M$$ possible sensor results: $${z_1, z_2, ..., z_M}$$. Note that when we calcuate the value function, the robot does not actually measure the physical world and the value function should incorporate all the possible scenarios. Once again, we are calcuating an expected value here:

$$V(b|z) = \sum_{i=1}^{M}V(b|z_i)p(z_i)$$

Now let's consider the effect of receiving a sensor data $$z_k$$. As mentioned earlier, it's a kind of value remapping. Let $$b'=(p_1', p_2', ..., p_N')$$ denote the new belief after the robot receives the sensor data $$z_k$$. We have

$$\mathbf{V(b|z_k)}(p_1, p_2, ..., p_N) = \mathbf{V(b)}(p_1', p_2', ..., p_N')$$

Where

$$\begin{eqnarray} p_i' & = & p(x_i|z_k) \\ & = & \frac{1}{p(z_k)}p(z_k|x_i)p(x_i) \end{eqnarray}$$

This gives us

$$\begin{eqnarray} \mathbf{V(b|z_k)}(p_1, p_2, ..., p_N) & = & \mathbf{V(b)}(p_1', p_2', ..., p_N') \\ & = & \mathbf{V(b)}(\frac{1}{p(z_k)}p(z_k|x_1)p(x_1), \frac{1}{p(z_k)}p(z_k|x_2)p(x_2), ..., \frac{1}{p(z_k)}p(z_k|x_N)p(x_N)) \\ & = & \max_u \left\{\mathbf{r(b,u)}(\frac{1}{p(z_k)}p(z_k|x_1)p(x_1), \frac{1}{p(z_k)}p(z_k|x_2)p(x_2), ..., \frac{1}{p(z_k)}p(z_k|x_N)p(x_N))\right\} \\ & = & \frac{1}{p(z_k)} \max_u \left\{\mathbf{r(b,u)}(p(z_k|x_1)p(x_1), p(z_k|x_2)p(x_2), ..., p(z_k|x_N)p(x_N))\right\} \\ & = & \frac{1}{p(z_k)} \max_u \left\{\mathbf{r'(b,u)}(p(x_1), p(x_2), ..., p(x_N))\right\} \end{eqnarray}$$

where

$$\mathbf{r'(b,u)}(p(x_1), p(x_2), ..., p(x_N)) = \mathbf{r(b,u)}(p(z_k|x_1)p(x_1), p(z_k|x_2)p(x_2), ..., p(z_k|x_N)p(x_N))$$

The last equality holds because the reward function $$r(b,u)$$ is linear with respect to $$(p_1, p_2, ..., p_N)$$. It also indicates that the effect of sensing is an update in the reward function.

$$\begin{eqnarray} \mathbf{V(b|z)}(p_1, p_2, ..., p_N) & = & \sum_{i=1}^{M}V(b|z_i)p(z_i) \\ & = & \sum_{i=1}^{M}\left(\frac{1}{p(z_k)} \max_u \left\{\mathbf{r(b,u)}(p(z_k|x_1)p(x_1), p(z_k|x_2)p(x_2), ..., p(z_k|x_N)p(x_N))\right\}\right)p(z_i) \\ & = & \sum_{i=1}^{M} \max_u \left\{\mathbf{r(b,u)}(p(z_k|x_1)p(x_1), p(z_k|x_2)p(x_2), ..., p(z_k|x_N)p(x_N))\right\} \end{eqnarray}$$

An naive implemenation of calculating sum of max leads to an exponential complexity. To understand why, suppose we have four functions $$f_1, f_2, g_1, g_2$$, then we have

$$max \left\{ f_1, f_2 \right\} + max \left\{g_1, g_2\right\} = max\left\{ f_1 + g_1, f_1 + g_2, f_2 + g_1, f_2 + g_2 \right\}$$

Here is a brief summary of this section. Sensing has three major impact on the calculation of the value function

1. It updates the reward function $$r(x,u)$$.
2. It introduces a sum operation in the calculation of the value function.
3. It may introduce an exponential complexity in the calculation.

### 6. Effect of control

If the value function $$V(b)$$ is given, then the policy $$\pi(b)$$ can be calcuated in a deterministic way. What is probabilistic is the state of the robot after we apply the control $$\pi(b)$$. In a similar way to sensing, this probabilistic effect is equivalent to shifting belief in the belief space and updating the reward function.

Mathematically, the effect of control is given as

$$\mathbf{ V(b|u) }(p_1, p_2, ..., p_N) = \mathbf{ V(b) } ( p(x_1|u), p(x_2|u), ..., p(x_N|u) ) + \mathbf{ r(b, u) } (p_1, p_2, ..., p_N)$$

Now, let's calacuate $$p(x_k, u)$$

$$\begin{eqnarray} p(x_k|u) = p(\textrm{next state} = x_k | u) & = & \sum_{i=1}^{N} p( \textrm{next state} = x_k, \textrm{current state} = x_i | u) \\ & = & \sum_{i=1}^{N} p( \textrm{next state} = x_k | \textrm{current state} = x_i, u) \cdot p( \textrm{current state} = x_i | u)\\ & = & \sum_{i=1}^{N} p(x_k|u, x_i)p_i \end{eqnarray}$$

Put them together we get the following equation (Eq:1)

$$\begin{eqnarray} \mathbf{ V(b|u) }(p_1, p_2, ..., p_N) & = & \mathbf{ V(b) } ( \sum_{i=1}^{N} p(x_1|u, x_i)p_i, \sum_{i=1}^{N} p(x_2|u, x_i)p_i, ..., \sum_{i=1}^{N} p(x_N|u, x_i)p_i ) + \sum_{i=1}^{N}r(x_i, u)p_i\\ & = & \max_u \left\{\mathbf{r(b,u)} ( \sum_{i=1}^{N} p(x_1|u, x_i)p_i, \sum_{i=1}^{N} p(x_2|u, x_i)p_i, ..., \sum_{i=1}^{N} p(x_N|u, x_i)p_i ) \right\} + \sum_{i=1}^{N}r(x_i, u)p_i \end{eqnarray}$$

Recall that the reward function is a linear function with respect to the belief.

$$\mathbf{ r(b,u) } (p_1, p_2, ..., p_N) = \sum_{i=1}^{N} v_i p_i$$

It follows that

$$\begin{eqnarray} \mathbf{r(b,u)} ( \sum_{i=1}^{N} p(x_1|u, x_i)p_i, \sum_{i=1}^{N} p(x_2|u, x_i)p_i, ..., \sum_{i=1}^{N} p(x_N|u, x_i)p_i ) & = & \sum_{j=1}^{N} v_j \sum_{i=1}^{N} p(x_j|u, x_i)p_i \\ & = & \sum_{i=1}^{N} \left( \sum_{j=1}^{N} v_j p(x_j|u, x_i) \right) p_i \end{eqnarray}$$

Note that we can move the item outisde the max operator in to the bracket in equation (Eq:1) and we get

$$\mathbf{ V(b|u) }(p_1, p_2, ..., p_N) = \max_u \left\{ \sum_{i=1}^{N} \left( r(x_i, u) + \sum_{j=1}^{N} v_j p(x_j|u, x_i) \right) p_i \right\}$$

This expression is can be viewed as an update of the reward function.

The effect of control is similar to the effect of sensing. Both of them update the rewrad function. However, there are some differences. First, there is no exponential complexity involved in the control calculation. Second, the source of the sum operator in two cases are different. In sensing case, the sum comes from aggregating all possible sensing outcomes while in the control case, the sum comes from aggregating all possible states of the robot.

### 7. POMDP Algorithm ----- END -----  