Reinforcement Learning 1

Selian — Tue, 14 Apr 2026 00:00:00 GMT

Introduction

RL Applications

Robotics & Autonomous Driving
Game AI
Finance & Training
Recommendation System
Optimization System

ML Algorithms

State(), Action(), Environment, Reward()

Markov Decision Process

Grid World

Deterministic grid world vs Stochastic grid world

Markov Property

Stochastic Process

Discrete-time random process:
Continuous-time random process:
Markov Property:
- does not depend on past states.

Markov Process

: A finite set of states.
: A state transition probability matrix ,
- Sum of row = 1

Markov Decision Process

MDP:
- : State space
- : Action space
- : State transition probability,
- : Reward function
- : Discount factor
Model-based: known MDP
Model-free: unknown MDP

MDP and continuous OK.

Reward & Policy

Reward

Reward : scalar feedback
Agent’s goal: maximize the cumulative sum of rewards

All goals can be described by the maximization of the expected value of the cumulative sum of rewards. - Reward Hypothesis

State Transition Probability
Expected Reward for State-Action Pair
Expected Reward for State-Action-Next State Triple

Return

Return : total discounted reward

MDP가 discount하는 이유 - Mathematically convenient, Accounts for uncertainty

Policy

A stochastic policy
A deterministic policy

Under a known MDP, exists.
Under an unknown MDP, an -greedy policy is needed
- : Choose the optimal value
- : Choose randomly

Summary of Notations

: Policy
: Stochastic policy
: Deterministic policy

: State-value function
: Optimal state-value function

: Action-value function
: Optimal action-value function

Bellman Equation

Value Functions

Goodness of each state (or ) when following policy , in terms fo the expectation of .

State-Value function
Action-Value function

Bellman Equation

Optimal Policy

Optimal Value Functions and Policy

Optimal state-value function:
Optimal action-value function:

Theorem

If , for all

Optimal policy for all

Finding an Optimal Policy

can be obtained directly from
cannot be obtained directly from
- Instead, we need to know the transition probability under model-based settings.

Reinforcement Learning 2

Selian — Tue, 14 Apr 2026 00:00:00 GMT

Dynamic Programming

Method for solving Markov Decision Processes (MDPs)

DP assumes
- Model-based
- The Markov property

DP uses the Bellman equation to iteratively update value functions.

Goal
- Compute the optimal value function
- Derive the optimal policy

Two main approach

Value-based approach
- Directly update value functions
- Leads to Value Iteration
Policy-based approach
- Evaluate and improve policies
- Leads to Policy Iteration

Value Iteration

Bellman Optimality Equation

Value Iteration Procedure

Initialize Update Compute the Optimal Policy

Disadvantages 1. Policy often converges long before the values converge: values rarely changes 2. Slow: per iteration and needs many iterations to converge.

Convergence - for all

Value Iteration Pseudo Code

\begin{algorithm} \caption{Value Iteration for estimating $\pi \approx \pi_*$} \begin{algorithmic} \State \textbf{Hyperparameter:} small threshold $\epsilon > 0$ for the convergence check \State Initialize $V(s)$ arbitrarily for all $s \in \mathcal{S}$, except $V(\text{terminal}) = 0$ \Repeat \State $\Delta \gets 0$ \For{each $s \in \mathcal{S}$} \State $v \gets V(s)$ \State $V(s) \gets \max_{a} \sum_{s', r} p(s', r \mid s, a) [r + \gamma V(s')]$ \State $\Delta \gets \max(\Delta, |v - V(s)|)$ \EndFor \Until{$\Delta < \epsilon$} \State \textbf{Output} a deterministic policy $\pi \approx \pi_*$ such that \State $\pi(s) = \arg\max_{a} \sum_{s', r} p(s', r \mid s, a) [r + \gamma V(s')]$ \end{algorithmic} \end{algorithm}

Policy Iteration

Policy Iteration Procedure

Policy Evaluation
- Compute from the deterministic policy
1. Initialize for all states .
2. Update from all (full backup) until convergence to
Policy Improvement
- Improve policy to using a greedy policy based on

Value vs Policy Iteration

In Value Iteration, is computed at the end using .
In Policy Iteration, improvement is done at every step.

Comparison to Value Iteration - Fewer iterations are needed to reach the optimal policy. - Faster convergence because the value update is based on a fixed policy.

Since , always either
1. is strictly better than , or
2. is optimal when

Policy Improvement Theorem

Policy Improvement

Policy Improvement Theorem

Let and be two policies.

If for all , for all .

This implies that is better policy than .

Policy Iteration

- A finite MDP has finitely many policies, so this process converges to an optimal policy and an optimal value function in finite iterations. - If is as good as, but not better than , then and satisfies Bellman optimality equation - - Thus, is optimal.

Pseudo Code

\begin{algorithm} \caption{Policy Iteration for estimating $\pi \approx \pi_*$} \begin{algorithmic} \State \textbf{1. Initialization} \State $V(s) \in \mathbb{R}$ and $\pi(s) \in \mathcal{A}(s)$ arbitrarily for all $s \in \mathcal{S}$ \State \State \textbf{2. Policy Evaluation} \Repeat \State $\Delta \gets 0$ \For{each $s \in \mathcal{S}$} \State $v \gets V(s)$ \State $V(s) \gets \sum_{s', r} p(s', r \mid s, \pi(s)) [r + \gamma V(s')]$ \State $\Delta \gets \max(\Delta, |v - V(s)|)$ \EndFor \Until{$\Delta < \epsilon$} \State \State \textbf{3. Policy Improvement} \State $policy-stable \gets \text{true}$ \For{each $s \in \mathcal{S}$} \State $old-action \gets \pi(s)$ \State $\pi(s) \gets \arg\max_{a} \sum_{s', r} p(s', r \mid s, a) [r + \gamma V(s')]$ \If{$old-action \neq \pi(s)$} \State $policy-stable \gets \text{false}$ \EndIf \EndFor \State \If{$policy-stable$} \State \textbf{stop} and return $V \approx v_*$ and $\pi \approx \pi_*$ \Else \State \textbf{go to 2} \EndIf \end{algorithmic} \end{algorithm}

Summary

Finding an optimal policy by solving Bellman optimality equation requires

Markov property
Accurate knowledge of environment dynamics (known MDP)
Enough space and time to do the computation

Dynamic Programming

Under model-based, each iteration updates every value function in the table using full backup effective for medium-sized problems
Usually evaluate instead of because
For large problems, DP suffers from the curse of dimensionality:
- The number of states grows exponentially with the number of state variables

Reinforcement Learning

DP vs RL

Dynamic Programming (DP)
- Planning under model-based setting using full-backup.
- Each iteration updates every value in the table using full backup.
- Usually, evaluate rather than .
- Greedy policy improvement over requires known MDP.

Reinforcement Learning (RL)
- Learning under model-free setting using sample backup, and approximately solving the Bellman optimality equation.
- Monte Carlo (MC) method
- Temporal Difference (TD) learnings
  - Sarsa
  - Q-learning
- Each iteration updates some values in the table from sampl backup.
- We evaluate instead of .
- Greedy policy improvement over works for model-free settings:

Generalized Policy Iteration (GPI)

Policy Evaluation makes the vlaue function “consistent with the current policy”
Policy Improvement makes the policy “greedy w.r.t. the current value function”

Monte Carlo Methods

Monte Carlo (MC)

Repeated random sampling to compute numerical results
Tabular updating and model-free method.
MC Policy Iteration adapts GPI based on episode-by-episode of PE estimating and e-greedy PI.
MC learns from entire trajectory fo sampled episodes, updating after every single episode using real experience.

MC focuses on a small subset of the states.
이 방법은 ’Successor value estimates’에 의존하지 않기 때문에 Markov property 위반으로 인한 피해가 적다.

MC Prediction (Policy Evaluation)

Goal: learn from enitre episodes of real experience under policy .
MC Policy Evaluation uses empirical mean return instead of expected return.
To estimate
1. For each time step when state is visited and action is taken:
  - Increment count:
  - Increment total return:
  - Estimate mean return:
As ,

Incremental MC updates

Incremental Mean
- Partial mean of sequence is computed incrementally
Incremental Monte Carlo Updates
- Increment count:
- Update rule:

constant MC Policy Evaluation

In practice, a step-size is used
Old episodes are exponentially forgotten due to the term.

Exploitation vs Exploration

Exploitation: making the best decision by using already known information
- By selecting the action with the highest Q-value while sampling new episodes, we can refine our policy efficiently from an alreadly promising region in the state-action space.
Exploration: searching for new decisions by gathering more information
- By selecting an extra random action while -probability while sampling new episodes, we can find a new and maybe more promising region within the state-action space.
To trade-off Exploitation and Exploration, use -greedy policy.

MC Control (-Greedy Policy Improvement)

Choose the greedy action with probability and a random action with probability .
All actions are selected with non-zero probability for ensuring continual exploration.
In full backup DP, Policy Improvement uses
For any -greedy policy , -greedy policy with respect to is always improved:

Greedy in Limit with Infinite Exploration (GLIE)

All state-action pairs are explored infinitely many times:
The learning policy converges to a greedy policy:
GLIE MC Control: -greedy is GLIE if as follows:
- In the -th episode using policy , for each state-action pair :
- Improve policy based on the new action-value function:
GLIE MC Control converges to the optimal:

\begin{algorithm} \caption{Monte Carlo Method} \begin{algorithmic} \State \textbf{1. Initialization} \State $Q(s,a)$, all $S\in\mathcal{S}$, $A\in\mathcal{A}(S)$, arbitrarily and $Q(\text{terminal}, \cdot) = 0$ \State Returns $(S,A) \gets$ empty list \State $\pi \gets$ arbitrarily $\epsilon$-soft policy (non-empty probabilities) \State \State \textbf{2. Repeat for each episode} \Repeat \State (a) Generate an episode using $\pi:S_0, A_0, R_1, S_1, \ldots, S_{T-1}, A_{T-1}, R_T$ \State $G\gets 0$ \State (b) \For{each step of episode, $t=T-1, T-2, \ldots, 0$} \State $G \gets \gamma G + R_{t+1}$ \State Unless $(S_t, A_t)$ appears in $(S_0, A_0), \ldots, (S_{t-1}, A_{t-1})$: ignore this lien for every-visit MC \State Append $G$ to Returns $(S_t, A_t)$ \State $Q(S_t, A_T)\gets$ average(Returns$(S_t, A_t)$) \EndFor \For{each $S_t$ in the episode} \State $A^*\gets\arg\max_{a}Q(S_t,a)$ \For{all $a\in\mathcal{A}(S_t)$} \State $\pi(a|S_t) \gets \begin{cases}1-\epsilon + \epsilon/|\mathcal{A}(S_t)| & \text{if } a=A^* \\ \epsilon/|\mathcal{A}(S_t)| & \text{otherwise}\end{cases}$ \EndFor \EndFor \Until{$\text{forever}$} \end{algorithmic} \end{algorithm}

SelianBlog

Reinforcement Learning 1

Introduction

ML Algorithms

Markov Decision Process

Grid World

Markov Property

Markov Decision Process

Reward & Policy

Reward

Return

Policy

Summary of Notations

Bellman Equation

Bellman Equation

Value Functions

Bellman Equation

Optimal Policy

Optimal Value Functions and Policy

Finding an Optimal Policy

Reinforcement Learning 2

Dynamic Programming

Value Iteration

Value Iteration Procedure

Value Iteration Pseudo Code

Policy Iteration

Policy Iteration Procedure

Value vs Policy Iteration

Policy Improvement

Policy Iteration

Pseudo Code

Summary

Reinforcement Learning

Reinforcement Learning

DP vs RL

Generalized Policy Iteration (GPI)

Monte Carlo Methods

Monte Carlo (MC)

MC Prediction (Policy Evaluation)

Incremental MC updates

constant MC Policy Evaluation

Exploitation vs Exploration

MC Control (-Greedy Policy Improvement)

Greedy in Limit with Infinite Exploration (GLIE)