Overview
The goal of reinforcement learning is to learn a policy \pi(a\vert s) = \Pr\left[ A=a\vert S=s\right] so that the agent will know what to do at each state s to maximize the expected cumulative reward it could receive in the future N steps (these steps are called an “episode”):
Q_\pi(s_t, a_t)
= \mathbb{E}_\pi \left[ U_t \vert S_t=s_t, A_t=a_t\right]\
= \mathbb{E}_\pi \left[ \sum_{k=0}^{N-t} \gamma^k R_{t+k} \vert S_t=s_t, A_t=a_t\right]
Under the hood, there is an unknown state transition function p(s’\vert s, a) = \Pr\left[ S’=s’\vert S=s, A=a\right] that is governed by the environment. What we could observe is a list of triplets (s_1, a_1, r_1) \rightarrow (s_2, a_2, r_2) \rightarrow \cdots (aka. “trajectory”); we could use collect the trajectory data to train the agent to find the best polity for it to execute.
As value function V(s) = \sum_a \pi(a\vert s) Q_\pi(s, a), there are three ways to maximize V(s):
| Goal | Model | Category | |
|---|---|---|---|
| Value-based | Learning optimal action-value function Q^*_\pi(s, a) | DQN + TD | Off-policy |
| Policy-based | Learning optimal policy function \pi^*(a\vert s) | REINFORCE (or policy-gradient algorithm) | On-policy |
| Hybrid | Learning both policy and value | Actor-Critic |
Here is the distinction of on-policy and off-policy algorithms (answer):
- On-policy: The policy we are using to control the agent is the same as the policy we are trying to improve.
- Off-policy: The policy we are using to control the agent is different as the policy we are trying to learn. For example, we could use a greedy policy or random policy to control the agent in the hope that we could learn a better policy than either random or greedy policy.
Value-Based
DQN + TD
The DQN tries to use a model (for example, a neural network) to approximate the an imaginary oracle Q^*(s_t, a_t) = \max_\pi Q_\pi(s_t, a_t) (aka. optimal action-value function); the optimal action-value function tells you which action is the best in an “averaged” sense.
The DQN could be trained with Temporal Difference (TD) algorithm. The idea of the TD algorithm is that the same objective could be computed with two ways:
- Pure estimation: q_t = Q(s_t, a_t; \mathbf{w}_t)
-
A partial observation plus estimation (aka. TD target): y_t = r_t + \gamma \cdot \max_\pi Q(s_{t+1}, a; \mathbf{w}_{t}).
Then we try to update \mathbf{w}_t by minimizing the difference between q_t and y_t so that in future iterations the model will perform better:
\mathbf{w}_{t+1} = \mathbf{w}_ t – \alpha \cdot (q_ t – y_ t) \cdot \frac{\partial Q(s_ t, a_ t;\mathbf{w})}{\partial \mathbf{w}} \rvert_{\mathbf{w} = \mathbf{w}_t}
The following are the steps for DQN + TD algorithm:
- Step 1: Observe state s_t and perform an action a_t.
- Step 2: In this step, we obtain three things:
- Two estimations for the same target: q_t = Q(s_t, a_t;\mathbf{w}_ t) and y_t=r_t + \gamma \cdot \max_a Q(s_{t+1}, a;\mathbf{w}_t)
- The gradient of the value network: \mathbf{d}_ t = \frac{\partial Q(s_ t, a_ t; \mathbf{w})}{\partial \mathbf{w}}\rvert_{\mathbf{w} =\mathbf{w}_ t}.
- Step 3: Update the policy network: \mathbf{w}_ {t+1} = \mathbf{w}_ t – \alpha \cdot (q_t – y_t)\cdot \mathbf{d}_ t. This update happens every time we observe a reward.
Policy-Based
Overview
- If both the actions and states are discrete and there are limited number of them, we could try to estimate the entries of a matrix of shape \vert S\vert \times \vert A\vert.
- If either the state space is continuous but the action space is still discrete, we need to learn a distribution (typically a neural network) \pi(a\vert s; \theta) over the set of actions given each state s.
Policy-Gradient Algorithm
The goal is to maximize V(s;\theta) = \mathbb{E}_ A \left[Q_\pi(s, A)\right] = \sum_a \pi(a\vert s;\theta) \cdot Q_\pi(s, a) based on \frac{\partial V(s;\theta)}{\partial \theta} by learning a policy network \pi(a\vert s; \theta). After some derivation, this gradient is equal to:
\frac{\partial V(s;\theta)}{\partial \theta}=\mathbb{E}_ A\left[Q_\pi(s,A) \cdot \frac{\partial \log \pi(A\vert s;\theta)}{\partial \theta}\right]
Here are the steps for the policy-gradient algorithm:
- Step 1: Observe state s_t and randomly sample an action a_t based on \pi(a\vert s_t;\theta_t).
- Step 2: Evaluate the first part and the second part of the policy gradient:
- $q_t \approx Q_\pi(s_t, a_t)$.
- $\mathbf{d}_ {\theta, t} = \frac{\partial \log \pi(a_t\vert s;\theta)}{\partial \theta}\rvert_{\theta=\theta_t}$
- Step 3: Update the policy network: \theta_{t+1} = \theta_t + \beta \cdot q_t \cdot \mathbf{d}_{\theta, t}
REINFORCE
REINFORCE algorithm is one instance of policy-gradient algorithms. It requires collecting the entire trajectory before training the policy network \pi(a\vert s;\theta).
Glossary
| Name | Formula | Usage |
|---|---|---|
| Action-Value Function | $Q_\pi(s, a)$ | Action selection given a policy: how good if an agent picks action a in state s. |
| State-Value Function | $V_\pi(s) = \mathbb{E}_ A\left[ Q_\pi(s, A) \right] = \sum_a \pi(a\vert s_t) \cdot Q_\pi(s_t, a)$ | State evaluation given a polity: how good is the current situation s for a policy \pi. |
| $\mathbb{E}_ S\left[ V_ \pi(S) \right] = \mathbb{E}_ {S, A}\left[ Q_\pi(S, A) \right]$ | Evaluation of a policy \pi. |