Uncategorized - David Yang's Blogs

Overview

There are two ways of publishing blogs onto a personal WordPress site:

Manual Copy and Paste
Automatic Publishing based on the solution introduced in GitHub (written in Chinese).

The second approach is preferred because of its almost complete automation. However, we need to follow the guidelines below to make sure that the content is rendered correctly when Markdown is converted to HTML.

Note that the HTML rendering does not seem to fully comply with the official guideline. It it therefore less useful to use a markdown linting tool, such as markdownlint by David Anson.

Guidelines

There has to be a newline before and after a list of bullets. Here we start a lot of complicated structures.

Always use 4 spaces rather than a tab when creating a lower level bullet point.
- This is the test level 2.
  - This is the test level 3.
    - This is the test level 4.
If the contents in a bullet point has multiple paragraphs. Then it has to be organized as follows.

There has to be 4 spaces before the subordinate paragraphs. For example, this paragraph is the second paragraph.

This is the third paragraph of the bullet.
- This is the second level bullet.
  
  This is a paragraph subordinate to the second level bullet.

Here we come back to the normal paragraph. For the best presentation of images and formulas:

They should not be subordinate to any bullet point; there should be no space before an image or a formula.
It is better to have a newline before and after an image or a formula.

$a^2 + b^2 = c^2$

Here we come back to the normal paragraph again.

Here we have a special note on the formula:

The rendering is likely to go wrong when there are multiple formulas with subscripts since the _ is also used for emphasizing in Markdown syntax (see the reported issue on GitHub). One workaround is adding a space before and after the _, i.e., replacing _ with <space>_<space>.

The following is the code used to generate the content above:

## Overview

There are two ways of publishing blogs onto a personal WordPress site:

- Manual Copy and Paste
- Automatic Publishing based on the solution introduced in [GitHub](https://github.com/zhaoolee/WordPressXMLRPCTools) (written in Chinese).

The second approach is preferred because of its almost complete automation. However, we need to follow the guidelines below to make sure that the content is rendered correctly when Markdown is converted to HTML.

Note that the HTML rendering does not **seem** to fully comply with the official guideline. It it therefore less useful to use a markdown linting tool, such as `markdownlint` by David Anson.

## Guidelines

There **has to** be a newline before and after a list of bullets. Here we start a lot of complicated structures.

- Always use **4 spaces** rather than a `tab` when creating a lower level bullet point.
    - This is the test level 2.
        - This is the test level 3.
            - This is the test level 4.
- If the contents in a bullet point has multiple paragraphs. Then it has to be organized as follows.

    There has to be **4 spaces** before the subordinate paragraphs. For example, this paragraph is the second paragraph.

    This is the third paragraph of the bullet.

    - This is the second level bullet.

        This is a paragraph subordinate to the second level bullet.

Here we come back to the normal paragraph. For the best presentation of images and formulas:

- They should not be subordinate to any bullet point; there should be no space before an image or a formula.
- It is better to have a newline before and after an image or a formula.

![](https://raw.githubusercontent.com/guanqun-yang/remote-images/master/2023/08/upgit_20230827_1693175262.png)

 $a^2 + b^2 = c^2$ 

Here we come back to the normal paragraph again.

Overview

The goal of reinforcement learning is to learn a policy $\pi(a\vert s) = \Pr\left[ A=a\vert S=s\right]$ so that the agent will know what to do at each state $s$ to maximize the expected cumulative reward it could receive in the future $N$ steps (these steps are called an “episode”):

$Q_\pi(s_t, a_t) = \mathbb{E}_\pi \left[ U_t \vert S_t=s_t, A_t=a_t\right]\ = \mathbb{E}_\pi \left[ \sum_{k=0}^{N-t} \gamma^k R_{t+k} \vert S_t=s_t, A_t=a_t\right]$

Under the hood, there is an unknown state transition function $p(s’\vert s, a) = \Pr\left[ S’=s’\vert S=s, A=a\right]$ that is governed by the environment. What we could observe is a list of triplets $(s_1, a_1, r_1) \rightarrow (s_2, a_2, r_2) \rightarrow \cdots$ (aka. “trajectory”); we could use collect the trajectory data to train the agent to find the best polity for it to execute.

As value function $V(s) = \sum_a \pi(a\vert s) Q_\pi(s, a)$ , there are three ways to maximize $V(s)$ :

	Goal	Model	Category
Value-based	Learning optimal action-value function $Q^*_\pi(s, a)$	DQN + TD	Off-policy
Policy-based	Learning optimal policy function $\pi^*(a\vert s)$	REINFORCE (or policy-gradient algorithm)	On-policy
Hybrid	Learning both policy and value	Actor-Critic

Here is the distinction of on-policy and off-policy algorithms (answer):

On-policy: The policy we are using to control the agent is the same as the policy we are trying to improve.
Off-policy: The policy we are using to control the agent is different as the policy we are trying to learn. For example, we could use a greedy policy or random policy to control the agent in the hope that we could learn a better policy than either random or greedy policy.

Value-Based

DQN + TD

The DQN tries to use a model (for example, a neural network) to approximate the an imaginary oracle $Q^*(s_t, a_t) = \max_\pi Q_\pi(s_t, a_t)$ (aka. optimal action-value function); the optimal action-value function tells you which action is the best in an “averaged” sense.

The DQN could be trained with Temporal Difference (TD) algorithm. The idea of the TD algorithm is that the same objective could be computed with two ways:

Pure estimation: $q_t = Q(s_t, a_t; \mathbf{w}_t)$
A partial observation plus estimation (aka. TD target): $y_t = r_t + \gamma \cdot \max_\pi Q(s_{t+1}, a; \mathbf{w}_{t})$ .

Then we try to update $\mathbf{w}_t$ by minimizing the difference between $q_t$ and $y_t$ so that in future iterations the model will perform better:

$\mathbf{w}_{t+1} = \mathbf{w}_ t – \alpha \cdot (q_ t – y_ t) \cdot \frac{\partial Q(s_ t, a_ t;\mathbf{w})}{\partial \mathbf{w}} \rvert_{\mathbf{w} = \mathbf{w}_t}$

The following are the steps for DQN + TD algorithm:

Step 1: Observe state $s_t$ and perform an action $a_t$ .
Step 2: In this step, we obtain three things:
- Two estimations for the same target: $q_t = Q(s_t, a_t;\mathbf{w}_ t)$ and $y_t=r_t + \gamma \cdot \max_a Q(s_{t+1}, a;\mathbf{w}_t)$
- The gradient of the value network: $\mathbf{d}_ t = \frac{\partial Q(s_ t, a_ t; \mathbf{w})}{\partial \mathbf{w}}\rvert_{\mathbf{w} =\mathbf{w}_ t}$ .
Step 3: Update the policy network: $\mathbf{w}_ {t+1} = \mathbf{w}_ t – \alpha \cdot (q_t – y_t)\cdot \mathbf{d}_ t$ . This update happens every time we observe a reward.

Policy-Based

Overview

If both the actions and states are discrete and there are limited number of them, we could try to estimate the entries of a matrix of shape $\vert S\vert \times \vert A\vert$ .
If either the state space is continuous but the action space is still discrete, we need to learn a distribution (typically a neural network) $\pi(a\vert s; \theta)$ over the set of actions given each state $s$ .

Policy-Gradient Algorithm

The goal is to maximize $V(s;\theta) = \mathbb{E}_ A \left[Q_\pi(s, A)\right] = \sum_a \pi(a\vert s;\theta) \cdot Q_\pi(s, a)$ based on $\frac{\partial V(s;\theta)}{\partial \theta}$ by learning a policy network $\pi(a\vert s; \theta)$ . After some derivation, this gradient is equal to:

$\frac{\partial V(s;\theta)}{\partial \theta}=\mathbb{E}_ A\left[Q_\pi(s,A) \cdot \frac{\partial \log \pi(A\vert s;\theta)}{\partial \theta}\right]$

Here are the steps for the policy-gradient algorithm:

Step 1: Observe state $s_t$ and randomly sample an action $a_t$ based on $\pi(a\vert s_t;\theta_t)$ .
Step 2: Evaluate the first part and the second part of the policy gradient:
- $q_t \approx Q_\pi(s_t, a_t)$.
- $\mathbf{d}_ {\theta, t} = \frac{\partial \log \pi(a_t\vert s;\theta)}{\partial \theta}\rvert_{\theta=\theta_t}$
Step 3: Update the policy network: $\theta_{t+1} = \theta_t + \beta \cdot q_t \cdot \mathbf{d}_{\theta, t}$

REINFORCE

REINFORCE algorithm is one instance of policy-gradient algorithms. It requires collecting the entire trajectory before training the policy network $\pi(a\vert s;\theta)$ .

Glossary

Name	Formula	Usage
Action-Value Function	$Q_\pi(s, a)$	Action selection given a policy: how good if an agent picks action $a$ in state $s$ .
State-Value Function	$V_\pi(s) = \mathbb{E}_ A\left[ Q_\pi(s, A) \right] = \sum_a \pi(a\vert s_t) \cdot Q_\pi(s_t, a)$	State evaluation given a polity: how good is the current situation $s$ for a policy $\pi$ .
	$\mathbb{E}_ S\left[ V_ \pi(S) \right] = \mathbb{E}_ {S, A}\left[ Q_\pi(S, A) \right]$	Evaluation of a policy $\pi$ .

Category: Uncategorized

Meta | Formatting Requirements of Automated WordPress Publishing

Overview

Guidelines

Summary | Reinforcement Learning Basics

Overview

Value-Based

DQN + TD

Policy-Based

Overview

Policy-Gradient Algorithm

REINFORCE

Glossary