Lecture 4: Introduction to Reinforcement Learning

Lecture Slides & Videos

Definitions

Markov chain

$$ \mathcal{M} = {\mathcal{S}, \mathcal{T}} $$

$\mathcal{S}$ — state space
- state $\mathcal{s} \in \mathcal{S}$ ( discrete or continuous )
$\mathcal{T}$ — transition operator
- $p(s_{t+1}|s_t)$
- let $\mu_{t,i}=p(s_{t}=i)$, then $\vec{\mu}_t$ is a vector of probabilities
  let $\mathcal{T}_{i,j}=p(s_{t+1}=i|s_t=j)$, then $\vec{\mu}_{t+1}=\mathcal{T}\vec{\mu}_t$

Markov decision process

$$ \mathcal{M} = {\mathcal{S, A, T, r}} $$

$\mathcal{S}$ — state space
- state $\mathcal{s} \in \mathcal{S}$ ( discrete or continuous )
$\mathcal{A}$ — action space
- state $\mathcal{a} \in \mathcal{A}$ ( discrete or continuous )
$\mathcal{T}$ — transition operator ( now a tensor )
- let $\mu_{t,i}=p(s_{t}=i)$
- let $\xi_{t,k}=p(a_t=k)$
- let $\begin{aligned} \mathcal{T}_{i,j,k}=p(s_{t+1}=i|s_t=j,a_t=k) \end{aligned}$
- then $\begin{aligned} \mu_{t+1,i}=\sum_{j,k}\mathcal{T}_{i,j,k}\mu_{t,j}\xi_{t,k} \end{aligned}$
$\mathcal{r}$ — reward function
- $r:\mathcal{S}\times\mathcal{A}\to\mathbb{R}$

Partially observed Markov decision process

$$ \mathcal{M} = {\mathcal{S, A, O,T, E, r}} $$

$\mathcal{O}$ — observation space
- observations $\mathcal{o} \in \mathcal{O}$ ( discrete or continuous )
$\mathcal{E}$ — emission probability
- $p(o_t | s_t)$

The goal of reinforcement learning

$$ p_\theta(\mathbf{s}_1,\mathbf{a}_1,\ldots,\mathbf{s}_T,\mathbf{a}_T)=p(\mathbf{s}_1)\prod_{t=1}^T\pi_\theta(\mathbf{a}_t|\mathbf{s}_t)p(\mathbf{s}_{t+1}|\mathbf{s}_t,\mathbf{a}_t) $$

where $\tau = \mathbf{s}_1,\mathbf{a}_1,\ldots,\mathbf{s}_T,\mathbf{a}_T$ is called trajectory, a sequence of states and actions.

Therefore, we can define an objective for reinforcement learning: to find the parameters $\theta$ that define our policy so as to maximize the expected value of the sum of the rewards over the trajectory.

$$ \begin{aligned}\theta^\star=\arg\max_\theta E_{\tau\thicksim p_\theta(\tau)}\left[\sum_tr(\mathbf{s}_t,\mathbf{a}_t)\right]\end{aligned} $$

Finite horizon case: state-action marginal

We can group the state and action together into a kind of augmented state, which actually from a Markov chain.

$$ p((\mathbf{s}_{t+1},\mathbf{a}_{t+1})|(\mathbf{s}_t,\mathbf{a}_t))=p(\mathbf{s}_{t+1}|\mathbf{s}_t,\mathbf{a}_t)\pi_\theta(\mathbf{a}_{t+1}|\mathbf{s}_{t+1}) $$

Therefore, we can define the object in a slightly different way

$$ \theta^\star=\arg\max_\theta\sum_{t=1}^TE_{(\mathbf{s}_t,\mathbf{a}_t)\sim p_\theta(\mathbf{s}_t,\mathbf{a}_t)}[r(\mathbf{s}_t,\mathbf{a}_t)] $$

Infinite horizon case: stationary distribution

Expectations and stochastic systems

Reinforcement learning is really about optimizing expectations.

Although we talk about reinforcement learning in terms of choosing actions that lead to high rewards, we’re always really concerned about the expected values of rewards.

The interesting thing about the expected values is that the expected values can be continuous in the parameters of the corresponding distributions even when the function that we’re talking the expectation of is itself highly discontinuous, which is the reason why reinforcement learning algorithms can use smooth optimization methods like gradient descent to optimize objective that are seemingly non-differentiable, like binary rewards for winning or losing a game.

Algorithm

The anatomy of a reinforcement learning algorithm

Which parts are expensive?

generate samples
- If we use a real robot real world, it will be rather expensive.
- If we use a simulator, the cost might be trivial.
fit a model/estimate the return
- If we just sum up the rewards we obtained, this might be rather cheap.
- If we are learning an entire model by training a whole of neural net, this might require a big supervised learning run in the inner loop of the RL algorithm.
improve the policy
- If we are just taking one gradient step, this might be rather cheap.
- If we have a backprop through the model, this might be rather expensive.

Value Functions

How do we deal with all these expectations?

As mentioned above, the reinforcement learning objective can be defined as an expectation

$$ E_{\tau\sim p_\theta(\tau)}\left[\sum_{t=1}^Tr(\mathbf{s}_t,\mathbf{a}_t)\right] $$

We can actually write it out recursively

$$ E_{\mathbf{s}_{1}\sim p(\mathbf{s}_{1})}\left[E_{\mathbf{a}_{1}\sim\pi(\mathbf{a}_{1}|\mathbf{s}_{1})}\left[r(\mathbf{s}_{1},\mathbf{a}_{1})+E_{\mathbf{s}_{2}\sim p(\mathbf{s}_{2}|\mathbf{s}_{1},\mathbf{a}_{1})}\left[E_{\mathbf{a}_{2}\sim\pi(\mathbf{a}_{2}|\mathbf{s}_{2})}\left[r(\mathbf{s}_{2},\mathbf{a}_{2})+…|\mathbf{s}_{2}\right]|\mathbf{s}_{1},\mathbf{a}_{1}\right]|\mathbf{s}_{1}\right]\right] $$

What if we have some functions that tells us $Q(\mathbf{s}_1,\mathbf{a}_1)$, where

$$ Q(\mathbf{s}_1,\mathbf{a}_1)=r(\mathbf{s}_1,\mathbf{a}_1)+E_{\mathbf{s}_2\sim p(\mathbf{s}_2|\mathbf{s}_1,\mathbf{a}_1)}\left[E_{\mathbf{a}_2\sim\pi(\mathbf{a}_2|\mathbf{s}_2)}\left[r(\mathbf{s}_2,\mathbf{a}_2)+…|\mathbf{s}_2\right]|\mathbf{s}_1,\mathbf{a}_1\right] $$

Q-function: total reward from taking $\mathbf{a}_t$ in $\mathbf{s}_t$

$$ Q^\pi(\mathbf{s}_t,\mathbf{a}_t)=\sum_{t^{\prime}=t}^TE_{\pi_\theta}\left[r(\mathbf{s}_{t^{\prime}},\mathbf{a}_{t^{\prime}})|\mathbf{s}_t,\mathbf{a}_t\right] $$

Value function: total reward from $\mathbf{s}_t$

$$ V^{\pi}(\mathbf{s}_{t})=\sum_{t^{\prime}=t}^{T}E_{\pi_{\theta}}[r(\mathbf{s}_{t^{\prime}},\mathbf{a}_{t^{\prime}})|\mathbf{s}_{t}] = E_{\mathbf{a}_{t}\sim\pi(\mathbf{a}_{t}|\mathbf{s}_{t})}[Q^{\pi}(\mathbf{s}_{t},\mathbf{a}_{t})] $$

Therefore, the reinforcement learning objective can be modified as

$$ E_{\mathbf{s}_1\sim p(\mathbf{s}_1)}[V^\pi(\mathbf{s}_1)] $$

Using Q-functions and value functions

Idea 1: if we have policy $\pi$, and we know $Q^{\pi}(\mathbf{s},\mathbf{a})$, then we can improve $\pi$:

set $\pi^{\prime}(\mathbf{a} | \mathbf{s}) = 1$ if $\mathbf{a} = \arg\max_\mathbf{a}Q^\pi(\mathbf{s},\mathbf{a})$
- this policy is at least as good as $\pi$ and probably better
- it doesn’t matter what $\pi$ is

Idea 2: compute gradient to increase probability of good actions $\mathbf{a}$

if $Q^\pi(\mathbf{s},\mathbf{a}) \gt V^\pi(\mathbf{s})$, then $\mathbf{a}$ is better than average
- modify $\pi(\mathbf{a} | \mathbf{s})$ to increase probability of $\mathbf{a}$ if $Q^\pi(\mathbf{s},\mathbf{a}) \gt V^\pi(\mathbf{s})$

Types of Algorithms

As we have mentioned above, we are going to optimize the RL objective

$$ \begin{aligned}\theta^\star=\arg\max_\theta E_{\tau\thicksim p_\theta(\tau)}\left[\sum_tr(\mathbf{s}_t,\mathbf{a}_t)\right]\end{aligned} $$

Policy gradients: directly differentiate the above objective $\Rightarrow$ gradient descend procedure
Value-based: estimate value function or Q-function of the optimal policy ( no explicit policy )
- value function or Q-function themselves are typically represented by a neural network
Actor-critic: estimate value function or Q-function of the current policy, use it to improve policy
Model-based RL: estimate the transition model, and then…
- Use it for planning ( no explicit policy )
- Use it to improve a policy
- Something else

Model-based RL algorithms

Learn $p(\mathbf{s}_{t+1}|\mathbf{s}_t,\mathbf{a}_t)$
This could be a neural net that either outputs a probability distribution over $\mathbf{s}_{t+1}$ or a deterministic model attempting to predict $\mathbf{s}_{t+1}$ directly.
Just use the model to plan (no policy)
- Trajectory optimization/optimal control ( primarily in continuous spaces ) – essentially backpropagation to optimize over actions.
- Discrete planning in discrete action spaces, e.g., Monte Carlo tree search
Backpropagate gradients into the policy
- Requires some tricks to make it work
Use the model to learn a value function
- Dynamic programming
- Generate simulated experience for model-free learner

Value function based algorithms

fit a model/estimate: fit $V(\mathbf{s})$ or $Q(\mathbf{s}, \mathbf{a})$
improve the policy: set $\pi(\mathbf{s})=\arg\max_\mathbf{a}Q(\mathbf{s},\mathbf{a})$

Direct policy gradient

fit a model/estimate: $R_{\tau}=\sum_{t}r(\mathbf{s}{t},\mathbf{a}{t})$
improve the policy: $\theta\leftarrow\theta+\alpha\nabla_\theta E[Q(\mathbf{s}_t,\mathbf{a}_t)]$

Actor-critic

fit a model/estimate: fit $V(\mathbf{s})$ or $Q(\mathbf{s}, \mathbf{a})$
improve the policy: $\theta\leftarrow\theta+\alpha\nabla_\theta E[Q(\mathbf{s}_t,\mathbf{a}_t)]$

Tradeoffs Between Algorithms

Comparison: sample efficiency

Sample efficiency: how many samples do we need to get a good policy

The most important question is that whether the algorithm is off policy?

off policy: able to improve the policy without generating new samples from that policy
on policy: each time the policy is changed, even a little bit, we need to generate new samples

Comparison: stability and ease of use

Supervised learning: almost always gradient descent

Reinforcement learning: often not gradient descent

Q-learning: fixed point iteration
Model-based RL: model is not optimized for expected reward
Policy gradient: is gradient descent, but also often the least efficient!
Don’t really understand!

Comparison: assumptions

full observability
- Generally assumed by value function fitting methods
- Can be mitigated by adding recurrence
episodic learning
- Often assumed by pure policy gradient methods
- Assumed by some model-based RL methods
continuity or smoothness
- Assumed by some continuous value function learning methods
- Often assumed by some model-based RL methods

Examples of Algorithms

Value function fitting methods
- Q-learning, DQN
- Temporal difference learning
- Fitted value iteration
Policy gradient methods
- REINFORCE
- Natural policy gradient
- Trust region policy optimization
Actor-critic algorithms
- Asynchronous advantage actor-critic (A3C)
- Soft actor-critic (SAC)
Model-based RL algorithms
- Dyna
- Guided policy search