Summary | Reinforcement Learning Basics

Overview

The goal of reinforcement learning is to learn a policy $\pi(a\vert s) = \Pr\left[ A=a\vert S=s\right]$ so that the agent will know what to do at each state $s$ to maximize the expected cumulative reward it could receive in the future $N$ steps (these steps are called an “episode”):

$Q_\pi(s_t, a_t) = \mathbb{E}_\pi \left[ U_t \vert S_t=s_t, A_t=a_t\right]\ = \mathbb{E}_\pi \left[ \sum_{k=0}^{N-t} \gamma^k R_{t+k} \vert S_t=s_t, A_t=a_t\right]$

Under the hood, there is an unknown state transition function $p(s’\vert s, a) = \Pr\left[ S’=s’\vert S=s, A=a\right]$ that is governed by the environment. What we could observe is a list of triplets $(s_1, a_1, r_1) \rightarrow (s_2, a_2, r_2) \rightarrow \cdots$ (aka. “trajectory”); we could use collect the trajectory data to train the agent to find the best polity for it to execute.

As value function $V(s) = \sum_a \pi(a\vert s) Q_\pi(s, a)$ , there are three ways to maximize $V(s)$ :

	Goal	Model	Category
Value-based	Learning optimal action-value function $Q^*_\pi(s, a)$	DQN + TD	Off-policy
Policy-based	Learning optimal policy function $\pi^*(a\vert s)$	REINFORCE (or policy-gradient algorithm)	On-policy
Hybrid	Learning both policy and value	Actor-Critic

Here is the distinction of on-policy and off-policy algorithms (answer):

On-policy: The policy we are using to control the agent is the same as the policy we are trying to improve.
Off-policy: The policy we are using to control the agent is different as the policy we are trying to learn. For example, we could use a greedy policy or random policy to control the agent in the hope that we could learn a better policy than either random or greedy policy.

Value-Based

DQN + TD

The DQN tries to use a model (for example, a neural network) to approximate the an imaginary oracle $Q^*(s_t, a_t) = \max_\pi Q_\pi(s_t, a_t)$ (aka. optimal action-value function); the optimal action-value function tells you which action is the best in an “averaged” sense.

The DQN could be trained with Temporal Difference (TD) algorithm. The idea of the TD algorithm is that the same objective could be computed with two ways:

Pure estimation: $q_t = Q(s_t, a_t; \mathbf{w}_t)$
A partial observation plus estimation (aka. TD target): $y_t = r_t + \gamma \cdot \max_\pi Q(s_{t+1}, a; \mathbf{w}_{t})$ .

Then we try to update $\mathbf{w}_t$ by minimizing the difference between $q_t$ and $y_t$ so that in future iterations the model will perform better:

$\mathbf{w}_{t+1} = \mathbf{w}_ t – \alpha \cdot (q_ t – y_ t) \cdot \frac{\partial Q(s_ t, a_ t;\mathbf{w})}{\partial \mathbf{w}} \rvert_{\mathbf{w} = \mathbf{w}_t}$

The following are the steps for DQN + TD algorithm:

Step 1: Observe state $s_t$ and perform an action $a_t$ .
Step 2: In this step, we obtain three things:
- Two estimations for the same target: $q_t = Q(s_t, a_t;\mathbf{w}_ t)$ and $y_t=r_t + \gamma \cdot \max_a Q(s_{t+1}, a;\mathbf{w}_t)$
- The gradient of the value network: $\mathbf{d}_ t = \frac{\partial Q(s_ t, a_ t; \mathbf{w})}{\partial \mathbf{w}}\rvert_{\mathbf{w} =\mathbf{w}_ t}$ .
Step 3: Update the policy network: $\mathbf{w}_ {t+1} = \mathbf{w}_ t – \alpha \cdot (q_t – y_t)\cdot \mathbf{d}_ t$ . This update happens every time we observe a reward.

Policy-Based

Overview

If both the actions and states are discrete and there are limited number of them, we could try to estimate the entries of a matrix of shape $\vert S\vert \times \vert A\vert$ .
If either the state space is continuous but the action space is still discrete, we need to learn a distribution (typically a neural network) $\pi(a\vert s; \theta)$ over the set of actions given each state $s$ .

Policy-Gradient Algorithm

The goal is to maximize $V(s;\theta) = \mathbb{E}_ A \left[Q_\pi(s, A)\right] = \sum_a \pi(a\vert s;\theta) \cdot Q_\pi(s, a)$ based on $\frac{\partial V(s;\theta)}{\partial \theta}$ by learning a policy network $\pi(a\vert s; \theta)$ . After some derivation, this gradient is equal to:

$\frac{\partial V(s;\theta)}{\partial \theta}=\mathbb{E}_ A\left[Q_\pi(s,A) \cdot \frac{\partial \log \pi(A\vert s;\theta)}{\partial \theta}\right]$

Here are the steps for the policy-gradient algorithm:

Step 1: Observe state $s_t$ and randomly sample an action $a_t$ based on $\pi(a\vert s_t;\theta_t)$ .
Step 2: Evaluate the first part and the second part of the policy gradient:
- $q_t \approx Q_\pi(s_t, a_t)$.
- $\mathbf{d}_ {\theta, t} = \frac{\partial \log \pi(a_t\vert s;\theta)}{\partial \theta}\rvert_{\theta=\theta_t}$
Step 3: Update the policy network: $\theta_{t+1} = \theta_t + \beta \cdot q_t \cdot \mathbf{d}_{\theta, t}$

REINFORCE

REINFORCE algorithm is one instance of policy-gradient algorithms. It requires collecting the entire trajectory before training the policy network $\pi(a\vert s;\theta)$ .

Glossary

Name	Formula	Usage
Action-Value Function	$Q_\pi(s, a)$	Action selection given a policy: how good if an agent picks action $a$ in state $s$ .
State-Value Function	$V_\pi(s) = \mathbb{E}_ A\left[ Q_\pi(s, A) \right] = \sum_a \pi(a\vert s_t) \cdot Q_\pi(s_t, a)$	State evaluation given a polity: how good is the current situation $s$ for a policy $\pi$ .
	$\mathbb{E}_ S\left[ V_ \pi(S) \right] = \mathbb{E}_ {S, A}\left[ Q_\pi(S, A) \right]$	Evaluation of a policy $\pi$ .