<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>SelianBlog</title>
<link>https://selianu.github.io/SelianBlog/posts.html</link>
<atom:link href="https://selianu.github.io/SelianBlog/posts.xml" rel="self" type="application/rss+xml"/>
<description>A blog built with Quarto</description>
<generator>quarto-1.9.37</generator>
<lastBuildDate>Tue, 14 Apr 2026 00:00:00 GMT</lastBuildDate>
<item>
  <title>Reinforcement Learning 1</title>
  <dc:creator>Selian </dc:creator>
  <link>https://selianu.github.io/SelianBlog/posts/Reinforcement Learning 1/</link>
  <description><![CDATA[ 




<section id="introduction" class="level2">
<h2 class="anchored" data-anchor-id="introduction">Introduction</h2>
<p><strong>RL Applications</strong></p>
<ul>
<li>Robotics &amp; Autonomous Driving</li>
<li>Game AI</li>
<li>Finance &amp; Training</li>
<li>Recommendation System</li>
<li>Optimization System</li>
</ul>
<section id="ml-algorithms" class="level3">
<h3 class="anchored" data-anchor-id="ml-algorithms">ML Algorithms</h3>
<p>State(<img src="https://latex.codecogs.com/png.latex?S_t">), Action(<img src="https://latex.codecogs.com/png.latex?A_t">), Environment, Reward(<img src="https://latex.codecogs.com/png.latex?R_%7Bt+1%7D">)</p>
<p><img src="https://selianu.github.io/SelianBlog/posts/Reinforcement Learning 1/RL.png" class="img-fluid"></p>
</section>
</section>
<section id="markov-decision-process" class="level2">
<h2 class="anchored" data-anchor-id="markov-decision-process">Markov Decision Process</h2>
<section id="grid-world" class="level3">
<h3 class="anchored" data-anchor-id="grid-world">Grid World</h3>
<ul>
<li><strong>Deterministic</strong> grid world <code>vs</code> <strong>Stochastic</strong> grid world</li>
</ul>
</section>
<section id="markov-property" class="level3">
<h3 class="anchored" data-anchor-id="markov-property">Markov Property</h3>
<p>Stochastic Process</p>
<ul>
<li>Discrete-time random process: <img src="https://latex.codecogs.com/png.latex?S_0,%20S_1,%20%5Ccdots,%20S_t,%20%5Ccdots"></li>
<li>Continuous-time random process: <img src="https://latex.codecogs.com/png.latex?%5C%7BS_t%7Ct%5Cge0%5C%7D"></li>
<li>Markov Property: <img src="https://latex.codecogs.com/png.latex?P(S_%7Bt+1%7D=s'%7CS_t=s)=P(S_%7Bt+1%7D=s'%7CS_0=s_0,S_1=s_1,%5Ccdots,S_t=s_t)">
<ul>
<li><img src="https://latex.codecogs.com/png.latex?S_%7Bt+1%7D=s'"> does not depend on past states.</li>
</ul></li>
</ul>
<p>Markov Process <img src="https://latex.codecogs.com/png.latex?(S,P)"></p>
<ul>
<li><img src="https://latex.codecogs.com/png.latex?S">: A finite set of states.</li>
<li><img src="https://latex.codecogs.com/png.latex?P">: A state transition probability matrix <img src="https://latex.codecogs.com/png.latex?%5BP_%7Bij%7D%5D">, <img src="https://latex.codecogs.com/png.latex?P_%7Bij%7D=P(s_j%7Cs_i)%20=%20P(S_%7Bt+1%7D=s_j%7CS_t=s_i)">
<ul>
<li>Sum of row = 1</li>
</ul></li>
</ul>
</section>
<section id="markov-decision-process-1" class="level3">
<h3 class="anchored" data-anchor-id="markov-decision-process-1">Markov Decision Process</h3>
<ul>
<li>MDP: <img src="https://latex.codecogs.com/png.latex?(S,%20A,%20P,%20R,%20%5Cgamma)">
<ul>
<li><img src="https://latex.codecogs.com/png.latex?S">: State space</li>
<li><img src="https://latex.codecogs.com/png.latex?A">: Action space</li>
<li><img src="https://latex.codecogs.com/png.latex?P">: State transition probability, <img src="https://latex.codecogs.com/png.latex?P%5E%7Ba%7D_%7Bss'%7D%20=%20p(s'%7Cs,a)%20=%20P(S_%7Bt+1%7D=s'%7CS_t=s,A_t=a)"></li>
<li><img src="https://latex.codecogs.com/png.latex?R">: Reward function <img src="https://latex.codecogs.com/png.latex?R_%7Bss'%7D%5Ea,R_s,R%5Ea_s"></li>
<li><img src="https://latex.codecogs.com/png.latex?%5Cgamma%5Cin%5B0,1%5D">: Discount factor</li>
</ul></li>
<li>Model-based: known MDP</li>
<li>Model-free: unknown MDP</li>
</ul>
<p>MDP <img src="https://latex.codecogs.com/png.latex?%5Crightarrow"> <img src="https://latex.codecogs.com/png.latex?S"> and <img src="https://latex.codecogs.com/png.latex?A"> continuous OK.</p>
</section>
<section id="reward-policy" class="level3">
<h3 class="anchored" data-anchor-id="reward-policy">Reward &amp; Policy</h3>
<section id="reward" class="level4">
<h4 class="anchored" data-anchor-id="reward">Reward</h4>
<ul>
<li>Reward <img src="https://latex.codecogs.com/png.latex?R_t">: scalar feedback</li>
<li>Agent’s goal: maximize the cumulative sum of rewards</li>
</ul>
<blockquote class="blockquote">
<p>All goals can be described by the maximization of the expected value of the cumulative sum of rewards. - Reward Hypothesis</p>
</blockquote>
<ul>
<li>State Transition Probability
<ul>
<li><img src="https://latex.codecogs.com/png.latex?P_%7Bss'%7D%5Ea%20=%20p(s'%7Cs,a)%20=%20P(S_%7Bt+1%7D=s'%7CS_t=s,A_t=a)%20=%20%5Csum%5Climits_%7Br%5Cin%20R%7Dp(s',r%7Cs,a)"></li>
</ul></li>
<li>Expected Reward for State-Action Pair
<ul>
<li><img src="https://latex.codecogs.com/png.latex?R_s%5Ea%20=%20r(s,a)%20=%20E%5BR_%7Bt+1%7D%7CS_t=s,A_t=a%5D%20=%20%5Csum%5Climits_%7Br%5Cin%20R%7Dr%5Csum%5Climits_%7Bs'%5Cin%20S%7Dp(r%7Cs,a)"></li>
</ul></li>
<li>Expected Reward for State-Action-Next State Triple
<ul>
<li><img src="https://latex.codecogs.com/png.latex?R_%7Bss'%7D%5Ea%20=%20r(s,a,s')%20=%20E%5BR_%7Bt+1%7D%7CS_t=s,A_t=a,S_%7Bt+1%7D=s'%5D%20=%20%5Cfrac%7B%5Csum%5Climits_%7Br%5Cin%20R%7Drp(r%7Cs,a,s')%7D%7Bp(s'%7Cs,a)%7D"></li>
</ul></li>
</ul>
</section>
<section id="return" class="level4">
<h4 class="anchored" data-anchor-id="return">Return</h4>
<ul>
<li>Return <img src="https://latex.codecogs.com/png.latex?G_t">: total discounted reward</li>
<li><img src="https://latex.codecogs.com/png.latex?G_t%20=%20R_%7Bt+1%7D%20+%20%5Cgamma%20R_%7Bt+2%7D%20+%20%5Cgamma%5E2%20R_%7Bt+3%7D%20+%20%5Ccdots%20=%20%5Csum%5Climits_%7Bk=0%7D%5E%7B%5Cinfty%7D%5Cgamma%5Ek%20R_%7Bt+k+1%7D"></li>
</ul>
<p>MDP가 discount하는 이유 - Mathematically convenient, Accounts for uncertainty</p>
</section>
<section id="policy" class="level4">
<h4 class="anchored" data-anchor-id="policy">Policy</h4>
<ul>
<li>A stochastic policy <img src="https://latex.codecogs.com/png.latex?%5Cpi(a%7Cs)%20=%20P(A_t=a%7CS_t=s)"></li>
<li>A deterministic policy <img src="https://latex.codecogs.com/png.latex?%5Cpi(s)%20=%20a"></li>
</ul>
<!-- -->
<ul>
<li>Under a known MDP, <img src="https://latex.codecogs.com/png.latex?%5Cpi_*(s)"> exists.</li>
<li>Under an unknown MDP, an <img src="https://latex.codecogs.com/png.latex?%5Cepsilon">-greedy policy is needed
<ul>
<li><img src="https://latex.codecogs.com/png.latex?1-%5Cepsilon">: Choose the optimal value</li>
<li><img src="https://latex.codecogs.com/png.latex?%5Cepsilon">: Choose randomly</li>
</ul></li>
</ul>
</section>
<section id="summary-of-notations" class="level4">
<h4 class="anchored" data-anchor-id="summary-of-notations">Summary of Notations</h4>
<ul>
<li><img src="https://latex.codecogs.com/png.latex?%5Cpi">: Policy</li>
<li><img src="https://latex.codecogs.com/png.latex?%5Cpi(a%7Cs)">: Stochastic policy</li>
<li><img src="https://latex.codecogs.com/png.latex?%5Cpi(s)">: Deterministic policy</li>
</ul>
<!-- -->
<ul>
<li><img src="https://latex.codecogs.com/png.latex?v_%7B%5Cpi%7D(s)">: State-value function</li>
<li><img src="https://latex.codecogs.com/png.latex?v_*(s)">: Optimal state-value function</li>
</ul>
<!-- -->
<ul>
<li><img src="https://latex.codecogs.com/png.latex?q_%7B%5Cpi%7D(s,a)">: Action-value function</li>
<li><img src="https://latex.codecogs.com/png.latex?q_*(s,a)">: Optimal action-value function</li>
</ul>
</section>
</section>
</section>
<section id="bellman-equation" class="level2">
<h2 class="anchored" data-anchor-id="bellman-equation">Bellman Equation</h2>
<section id="bellman-equation-1" class="level3">
<h3 class="anchored" data-anchor-id="bellman-equation-1">Bellman Equation</h3>
<section id="value-functions" class="level4">
<h4 class="anchored" data-anchor-id="value-functions">Value Functions</h4>
<p>Goodness of each state <img src="https://latex.codecogs.com/png.latex?s">(or <img src="https://latex.codecogs.com/png.latex?(s,a)">) when following policy <img src="https://latex.codecogs.com/png.latex?%5Cpi">, in terms fo the expectation of <img src="https://latex.codecogs.com/png.latex?G_t">.</p>
<ul>
<li>State-Value function
<ul>
<li><img src="https://latex.codecogs.com/png.latex?v_%5Cpi(s)%20=%20E_%5Cpi%5BG_t%7CS_t=s%5D%20=%20%5Csum%5Climits_a%20%5Cpi(a%7Cs)q(s,a)"></li>
</ul></li>
<li>Action-Value function
<ul>
<li><img src="https://latex.codecogs.com/png.latex?q_%5Cpi(s,a)%20=%20E_%5Cpi%5BG_t%7CS_t=s,A_t=a%5D"></li>
</ul></li>
</ul>
<!-- -->
<ul>
<li><img src="https://latex.codecogs.com/png.latex?A_%5Cpi(s,a)%20=%20q_%5Cpi(s,a)%20-%20v_%5Cpi(s)"></li>
</ul>
</section>
<section id="bellman-equation-2" class="level4">
<h4 class="anchored" data-anchor-id="bellman-equation-2">Bellman Equation</h4>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Baligned%7D%0Av_%5Cpi(s)%20&amp;=%20E_%5Cpi%5BG_t%7CS_t=s%5D%20%5C%5C%0A&amp;=%20%5Csum%5Climits_a%20E_%5Cpi%5BG_t%7CS_t=s,%20A_t=a%5D%20%5Ccdot%20P(A_t=a%7CS_t=s)%20%5C%5C%0A&amp;=%20%5Csum%5Climits_a%20%5Cpi(a%7Cs)%20E_%5Cpi%5BR_%7Bt+1%7D%20+%20%5Cgamma%20G_%7Bt+1%7D%7CS_t=s,%20A_t=a%5D%20%5C%5C%0A&amp;=%20%5Csum%5Climits_a%20%5Cpi(a%7Cs)%20E_%5Cpi%5BR_%7Bt+1%7D%20+%20%5Cgamma%20G_%7Bt+1%7D%7CS_t=s,%20A_t=a,%20R%7Bt+1%7D=r,%20S_%7Bt+1%7D=s'%5D%20%5C%5C%20&amp;%5Cqquad%5Cqquad%5Cqquad%5Ccdot%20P(S_%7Bt+1%7D=s',%20R_%7Bt+1%7D%20=%20r%7CS_t=s,%20A_t=a)%20%5C%5C%0A&amp;=%20%5Csum%5Climits_a%20%5Cpi(a%7Cs)%20%5Csum%5Climits_%7Bs',r%7D%20p(s',r%7Cs,a)%20%5Cleft%5Br%20+%20%5Cgamma%20E_%5Cpi%5BG_%7Bt+1%7D%7CS_%7Bt+1%7D=s'%5D%5Cright%5D%20%5C%5C%0A&amp;=%20%5Csum%5Climits_a%20%5Cpi(a%7Cs)%20%5Csum%5Climits_%7Bs',r%7D%5Cleft%5Brp(s',r%7Cs,a)%20+%20%5Cgamma%20E_%5Cpi%5BG_%7Bt+1%7D%7CS_%7Bt+1%7D=s'%5D%5Ccdot%20p(s',r%7Cs,a)%5Cright%5D%20%5C%5C%0A&amp;=%20%5Csum%5Climits_a%20%5Cpi(a%7Cs)%20%5Cleft%5B%20R%5Ea_s%20+%20%5Cgamma%5Csum%5Climits_%7Bs',r%7Dp(s',r%7Cs,a)v%5Cpi(s')%20%5Cright%5D%20%5C%5C%0A&amp;=%20%5Csum%5Climits_a%20%5Cpi(a%7Cs)%20%5Cleft%5Br+%5Cgamma%5Csum%5Climits_%7Bs'%7DP%5Ea_%7Bss'%7Dv_%5Cpi(s')%5Cright%5D%20%5C%5C%0A&amp;=%20%5Csum%5Climits_a%20%5Cpi(a%7Cs)%20E%5BR_%7Bt+1%7D%20+%20%5Cgamma%20v_%5Cpi(S_%7Bt+1%7D)%7CS_t=s,A_t=a%5D%20%5C%5C%0A&amp;=%20E%5BR_%7Bt+1%7D%20+%20%5Cgamma%20v_%5Cpi(S_%7Bt+1%7D)%7CS_t=s%5D%0A%5Cend%7Baligned%7D%0A"></p>
<p><br>
<!-- --></p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Baligned%7D%0Aq_%5Cpi(s,a)%20&amp;=%20E_%5Cpi%5BG_t%7CS_t=s,%20A_t=a%5D%20%5C%5C%0A&amp;=%20%5Csum%5Climits_%7Bs',r%7D%20p(s',r%7Cs,a)%20%5Cleft%5Br%20+%20%5Cgamma%20E_%5Cpi%20%5BG_%7Bt+1%7D%20%7C%20S_%7Bt+1%7D=s'%5D%5Cright%5D%20%5C%5C%0A&amp;=%20%5Csum%5Climits_%7Bs',r%7D%20p(s',r%7Cs,a)%20%5Cleft%5Br+%5Cgamma%5Csum_%7Ba'%7D%5Cpi(a'%7Cs')q(s',a')%5Cright%5D%20%5C%5C%0A&amp;=%20E_%5Cpi%5Cleft%5BR_%7Bt+1%7D%20+%20%5Cgamma%20v_%5Cpi(S_%7Bt+1%7D)%7CS_t=s,A_t=a%5Cright%5D%20%5C%5C%0A&amp;=%20E%5Cleft%5BR_%7Bt+1%7D%20+%20%5Cgamma%20E_%5Cpi%5Cleft%5Bq_%5Cpi(S_%7Bt+1%7D,%20A_%7Bt+1%7D)%7CS_%7Bt+1%7D=s'%5Cright%5D%7CS_t=s,%20A_t=a%5Cright%5D%20%5C%5C%0A&amp;=%20E%5Cleft%5BR_%7Bt+1%7D%20+%20%5Cgamma%20q_%5Cpi(S_%7Bt+1%7D,%20A_%7Bt+1%7D)%7CS_t=s,%20A_t=a%5Cright%5D%0A%5Cend%7Baligned%7D%0A"></p>
</section>
</section>
<section id="optimal-policy" class="level3">
<h3 class="anchored" data-anchor-id="optimal-policy">Optimal Policy</h3>
<section id="optimal-value-functions-and-policy" class="level4">
<h4 class="anchored" data-anchor-id="optimal-value-functions-and-policy">Optimal Value Functions and Policy</h4>
<ul>
<li>Optimal state-value function: <img src="https://latex.codecogs.com/png.latex?v_*(s)%20=%20%5Cmax%5Climits_%5Cpi%20v_%5Cpi(s)"></li>
<li>Optimal action-value function: <img src="https://latex.codecogs.com/png.latex?q_*(s,a)%20=%20%5Cmax%5Climits_%5Cpi%20q_%5Cpi(s,a)"></li>
</ul>
<div class="callout callout-style-default callout-important callout-titled" title="Theorem">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Important</span>Theorem
</div>
</div>
<div class="callout-body-container callout-body">
<p>If <img src="https://latex.codecogs.com/png.latex?%5Cpi'%20%5Cge%20%5Cpi">, <img src="https://latex.codecogs.com/png.latex?v_%7B%5Cpi'%7D(s)%20%5Cge%20v_%5Cpi(s)"> for all <img src="https://latex.codecogs.com/png.latex?s"></p>
<p>Optimal policy <img src="https://latex.codecogs.com/png.latex?%5Cpi_*%20%5Cge%20%5Cpi"> for all <img src="https://latex.codecogs.com/png.latex?%5Cpi"></p>
<p><img src="https://latex.codecogs.com/png.latex?v_%7B%5Cpi_*%7D(s)%20=%20v_*(s)">, <img src="https://latex.codecogs.com/png.latex?q_%7B%5Cpi_*%7D(s,a)%20=%20q_*(s,a)"></p>
</div>
</div>
</section>
<section id="finding-an-optimal-policy" class="level4">
<h4 class="anchored" data-anchor-id="finding-an-optimal-policy">Finding an Optimal Policy</h4>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cpi_*(s)%20=%20%5Cbegin%7Bcases%7D%201%20&amp;%20%5Ctext%7Bif%20%7D%20a%20=%20%5Carg%5Cmax%5Climits_a%20q_*(s,a)%20%5C%5C%200%20&amp;%20%5Ctext%7Botherwise%7D%20%5Cend%7Bcases%7D%0A"></p>
<ul>
<li><img src="https://latex.codecogs.com/png.latex?v_*(s)%20=%20%5Cmax%5Climits_a%20q_*(s,a)%5Cquad%20%5Cbecause%20v_%5Cpi(s)%20=%20%5Csum%5Climits_a%5Cpi(a%7Cs)q_%5Cpi(s,a)"></li>
<li><img src="https://latex.codecogs.com/png.latex?q_*(s,a)%20=%20r(s,a)%20+%20%5Cgamma%5Csum%5Climits_%7Bs'%7Dp(s'%7Cs,a)v_*(s')=R_s%5Ea%20+%20%5Cgamma%5Csum%5Climits_%7Bs'%7DP%5Ea_%7Bss'%7Dv_*(s')"></li>
</ul>
<!-- -->
<ul>
<li><img src="https://latex.codecogs.com/png.latex?v_*(s)"> can be obtained directly from <img src="https://latex.codecogs.com/png.latex?q_*(s,a)"></li>
<li><img src="https://latex.codecogs.com/png.latex?q_*(s,a)"> cannot be obtained directly from <img src="https://latex.codecogs.com/png.latex?v_*(s)">
<ul>
<li>Instead, we need to know the transition probability <img src="https://latex.codecogs.com/png.latex?p(s'%7Cs,a)"> under model-based settings.</li>
</ul></li>
</ul>


</section>
</section>
</section>

 ]]></description>
  <category>RL</category>
  <guid>https://selianu.github.io/SelianBlog/posts/Reinforcement Learning 1/</guid>
  <pubDate>Tue, 14 Apr 2026 00:00:00 GMT</pubDate>
  <media:content url="https://selianu.github.io/SelianBlog/posts/Reinforcement Learning 1/RL.png" medium="image" type="image/png" height="58" width="144"/>
</item>
<item>
  <title>Reinforcement Learning 2</title>
  <dc:creator>Selian </dc:creator>
  <link>https://selianu.github.io/SelianBlog/posts/Reinforcement Learning 2/</link>
  <description><![CDATA[ 




<section id="dynamic-programming" class="level2">
<h2 class="anchored" data-anchor-id="dynamic-programming">Dynamic Programming</h2>
<p>Method for solving Markov Decision Processes (MDPs)</p>
<ul>
<li>DP assumes
<ul>
<li>Model-based</li>
<li>The Markov property</li>
</ul></li>
</ul>
<!-- -->
<ul>
<li>DP uses the Bellman equation to iteratively update value functions.</li>
</ul>
<!-- -->
<ul>
<li>Goal
<ul>
<li>Compute the optimal value function</li>
<li>Derive the optimal policy</li>
</ul></li>
</ul>
<!-- -->
<p>Two main approach</p>
<ul>
<li><strong>Value-based approach</strong>
<ul>
<li>Directly update value functions</li>
<li>Leads to <strong>Value Iteration</strong></li>
</ul></li>
<li><strong>Policy-based approach</strong>
<ul>
<li>Evaluate and improve policies</li>
<li>Leads to <strong>Policy Iteration</strong></li>
</ul></li>
</ul>
<section id="value-iteration" class="level3">
<h3 class="anchored" data-anchor-id="value-iteration">Value Iteration</h3>
<p><strong>Bellman Optimality Equation</strong> <img src="https://latex.codecogs.com/png.latex?v_*(s)%20=%20%5Cmax%5Climits_a%5Csum%5Climits_%7Bs',r%7Dp(s',r%7Cs,a)%5Br+%5Cgamma%20v_*(s')%5D"></p>
<section id="value-iteration-procedure" class="level4">
<h4 class="anchored" data-anchor-id="value-iteration-procedure">Value Iteration Procedure</h4>
<p><img src="https://latex.codecogs.com/png.latex?V_%7Bk+1%7D(s)%5Cleftarrow%5Cmax%5Climits_a%5Csum%5Climits_%7Bs',r%7D%20p(s',r%7Cs,a)%5Br+%5Cgamma%20V_k(s')%5D"></p>
<p><strong>Initialize <img src="https://latex.codecogs.com/png.latex?%5Crightarrow"> Update <img src="https://latex.codecogs.com/png.latex?%5Crightarrow"> Compute the Optimal Policy</strong></p>
<p>Disadvantages 1. Policy often converges long before the values converge: values rarely changes 2. Slow: <img src="https://latex.codecogs.com/png.latex?O(S%5E2A)"> per iteration and needs many iterations to converge.</p>
<p>Convergence - <img src="https://latex.codecogs.com/png.latex?%5Cepsilon:%20%5Clvert%20V_%7Bk+1%7D(s)%20-%20V_%7Bk%7D(s)%20%5Crvert%20%3C%20%5Cepsilon"> for all <img src="https://latex.codecogs.com/png.latex?s"></p>
</section>
<section id="value-iteration-pseudo-code" class="level4">
<h4 class="anchored" data-anchor-id="value-iteration-pseudo-code">Value Iteration Pseudo Code</h4>
<div id="algo-value-iteration" class="pseudocode-container quarto-float" data-line-number-punc=":" data-indent-lines="false" data-pseudocode-number="1" data-no-end="false" data-comment-delimiter="//" data-indent-size="1.2em" data-caption-prefix="Algorithm" data-line-number="true">
<div class="pseudocode">
\begin{algorithm} \caption{Value Iteration for estimating $\pi \approx \pi_*$} \begin{algorithmic} \State \textbf{Hyperparameter:} small threshold $\epsilon &gt; 0$ for the convergence check \State Initialize $V(s)$ arbitrarily for all $s \in \mathcal{S}$, except $V(\text{terminal}) = 0$ \Repeat \State $\Delta \gets 0$ \For{each $s \in \mathcal{S}$} \State $v \gets V(s)$ \State $V(s) \gets \max_{a} \sum_{s', r} p(s', r \mid s, a) [r + \gamma V(s')]$ \State $\Delta \gets \max(\Delta, |v - V(s)|)$ \EndFor \Until{$\Delta &lt; \epsilon$} \State \textbf{Output} a deterministic policy $\pi \approx \pi_*$ such that \State $\pi(s) = \arg\max_{a} \sum_{s', r} p(s', r \mid s, a) [r + \gamma V(s')]$ \end{algorithmic} \end{algorithm}
</div>
</div>
</section>
</section>
<section id="policy-iteration" class="level3">
<h3 class="anchored" data-anchor-id="policy-iteration">Policy Iteration</h3>
<section id="policy-iteration-procedure" class="level4">
<h4 class="anchored" data-anchor-id="policy-iteration-procedure">Policy Iteration Procedure</h4>
<ol type="1">
<li><strong>Policy Evaluation</strong>
<ul>
<li>Compute <img src="https://latex.codecogs.com/png.latex?V%5E%5Cpi"> from the deterministic policy <img src="https://latex.codecogs.com/png.latex?%5Cpi"></li>
<li><img src="https://latex.codecogs.com/png.latex?V_%7Bk+1%7D(s)%5Cleftarrow%5Csum%5Climits_%7Bs',r%7D%20p(s',r%7Cs,%5Cpi(s))%20%5Br+%5Cgamma%20V_k(s')%5D"></li>
</ul>
<ol type="1">
<li>Initialize <img src="https://latex.codecogs.com/png.latex?V_0(s)=0"> for all states <img src="https://latex.codecogs.com/png.latex?s">.</li>
<li>Update <img src="https://latex.codecogs.com/png.latex?V_%7Bk+1%7D(s)"> from all <img src="https://latex.codecogs.com/png.latex?V_k(s')"> (full backup) <img src="https://latex.codecogs.com/png.latex?%5Crightarrow"> until convergence to <img src="https://latex.codecogs.com/png.latex?V%5E%5Cpi(s)"></li>
</ol></li>
<li><strong>Policy Improvement</strong>
<ul>
<li>Improve policy <img src="https://latex.codecogs.com/png.latex?%5Cpi"> to <img src="https://latex.codecogs.com/png.latex?%5Cpi'"> using a greedy policy based on <img src="https://latex.codecogs.com/png.latex?V%5E%5Cpi"></li>
<li><img src="https://latex.codecogs.com/png.latex?%5Cpi'(s)%20=%20%5Carg%5Cmax%5Climits_a%5Csum%5Climits_%7Bs',r%7D%20p(s',r%7Cs,a)%20%5Br+%5Cgamma%20V%5E%5Cpi(s')%5D%20=%20%5Carg%5Cmax%5Climits_a%20Q%5E%5Cpi(s,a)"></li>
</ul></li>
</ol>
</section>
<section id="value-vs-policy-iteration" class="level4">
<h4 class="anchored" data-anchor-id="value-vs-policy-iteration">Value vs Policy Iteration</h4>
<ul>
<li>In Value Iteration, <img src="https://latex.codecogs.com/png.latex?%5Cpi_*"> is computed at the end using <img src="https://latex.codecogs.com/png.latex?V%5E*">.</li>
<li>In Policy Iteration, improvement is done at every step.</li>
</ul>
<p><strong>Comparison to Value Iteration</strong> - Fewer iterations are needed to reach the optimal policy. - Faster convergence because the value update is based on a fixed policy.</p>
<!-- -->
<ul>
<li>Since <img src="https://latex.codecogs.com/png.latex?Q%5E%5Cpi(s,%5Cpi'(s))%5Cge%20V%5E%5Cpi(s)%20=%20%5Csum%5Climits_a%5Cpi(a%7Cs)%20Q%5E%5Cpi(s,a)">, always either
<ol type="1">
<li><img src="https://latex.codecogs.com/png.latex?%5Cpi'"> is strictly better than <img src="https://latex.codecogs.com/png.latex?%5Cpi">, or</li>
<li><img src="https://latex.codecogs.com/png.latex?%5Cpi'"> is optimal when <img src="https://latex.codecogs.com/png.latex?%5Cpi=%5Cpi'"></li>
</ol></li>
</ul>
<p><img src="https://latex.codecogs.com/png.latex?%5CRightarrow"> Policy Improvement Theorem</p>
</section>
<section id="policy-improvement" class="level4">
<h4 class="anchored" data-anchor-id="policy-improvement">Policy Improvement</h4>
<div class="callout callout-style-default callout-important callout-titled" title="Policy Improvement Theorem">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Important</span>Policy Improvement Theorem
</div>
</div>
<div class="callout-body-container callout-body">
<p>Let <img src="https://latex.codecogs.com/png.latex?%5Cpi"> and <img src="https://latex.codecogs.com/png.latex?%5Cpi'"> be two policies.</p>
<p>If <img src="https://latex.codecogs.com/png.latex?Q%5E%5Cpi(s,%5Cpi'(s))%5Cge%20V%5E%5Cpi(s)"> for all <img src="https://latex.codecogs.com/png.latex?s%5Cin%20S">, <img src="https://latex.codecogs.com/png.latex?V%5E%7B%5Cpi'%7D(s)%5Cge%20V%5E%5Cpi(s)"> for all <img src="https://latex.codecogs.com/png.latex?s%5Cin%20S">.</p>
<p>This implies that <img src="https://latex.codecogs.com/png.latex?%5Cpi'"> is better policy than <img src="https://latex.codecogs.com/png.latex?%5Cpi">.</p>
</div>
</div>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Baligned%7D%0Av_%5Cpi(s)%20&amp;%5Cle%20q_%5Cpi(s,%5Cpi'(s))%5C%5C%0A&amp;=%5Cmathbb%7BE%7D%5BR_%7Bt+1%7D+%5Cgamma%20v_%5Cpi(S_%7Bt+1%7D)%7CS_t=s,A_t=%5Cpi'(s)%5D%5C%5C%0A&amp;=%5Cmathbb%7BE%7D_%7B%5Cpi'%7D%5BR_%7Bt+1%7D+%5Cgamma%20v_%5Cpi(S_%7Bt+1%7D)%7CS_t=s%5D%5C%5C%0A&amp;%5Cle%5Cmathbb%7BE%7D_%7B%5Cpi'%7D%5BR_%7Bt+1%7D+%5Cgamma%20q_%5Cpi(S_%7Bt+1%7D,%5Cpi'(S_%7Bt+1%7D))%7CS_t=s%5D%5C%5C%0A&amp;=%5Cmathbb%7BE%7D_%7B%5Cpi'%7D%5BR_%7Bt+1%7D+%5Cgamma%20%5Cmathbb%7BE%7D_%7B%5Cpi'%7D%5BR_%7Bt+2%7D+%5Cgamma%20v_%7B%5Cpi%7D(S_%7Bt+2%7D)%7CS_%7Bt+1%7D,%20A_%7Bt+1%7D=%5Cpi'(S_%7Bt+1%7D)%5D%7CS_t=s%5D%5C%5C%0A&amp;=%5Cmathbb%7BE%7D_%7B%5Cpi'%7D%5BR_%7Bt+1%7D+%5Cgamma%20R_%7Bt+2%7D+%5Cgamma%5E2v_%5Cpi(S_%7Bt+2%7D)%7CS_t=s%5D%5C%5C%0A&amp;%5C%20%5C%20%5Cvdots%5C%5C%0A&amp;%5Cle%5Cmathbb%7BE%7D_%7B%5Cpi'%7D%5BR_%7Bt+1%7D+%5Cgamma%20R_%7Bt+2%7D+%5Cgamma%5E2%20R_%7Bt+3%7D+%5Ccdots%7CS_t=s%5D%5C%5C%0A&amp;=v_%7B%5Cpi'%7D(s)%0A%5Cend%7Baligned%7D%0A"></p>
</section>
<section id="policy-iteration-1" class="level4">
<h4 class="anchored" data-anchor-id="policy-iteration-1">Policy Iteration</h4>
<p><img src="https://latex.codecogs.com/png.latex?%5Cpi_0%20%5Crightarrow%20%5Ctext%7Bpolicy%20evaluation%7D%20%5Crightarrow%20V%5E%7B%5Cpi_0%7D%20%5Crightarrow%20%5Ctext%7Bpolicy%20improvement%7D%20%5Crightarrow%20%5Cpi_1%20%5Crightarrow%20%5Ccdots%20%5Crightarrow%20%5Cpi_*%20%5Crightarrow%20V%5E*"> - A <strong>finite MDP</strong> has finitely many policies, so this process converges to an <strong>optimal policy</strong> and an optimal value function in <strong>finite iterations</strong>. - If <img src="https://latex.codecogs.com/png.latex?%5Cpi'"> is as good as, but not better than <img src="https://latex.codecogs.com/png.latex?%5Cpi">, then <img src="https://latex.codecogs.com/png.latex?v_%5Cpi=v_%7B%5Cpi'%7D"> and satisfies Bellman optimality equation - <img src="https://latex.codecogs.com/png.latex?v_%7B%5Cpi'%7D(s)=%5Cmax%5Climits_a%5Csum%5Climits_%7Bs',r%7Dp(s',r%7Cs,a)%20%5Br+%5Cgamma%20v_%7B%5Cpi'%7D(s')%5D"> - Thus, <img src="https://latex.codecogs.com/png.latex?%5Cpi'"> is optimal.</p>
</section>
<section id="pseudo-code" class="level4">
<h4 class="anchored" data-anchor-id="pseudo-code">Pseudo Code</h4>
<div id="algo-policy-iteration" class="pseudocode-container quarto-float" data-line-number-punc=":" data-indent-lines="false" data-pseudocode-number="2" data-no-end="false" data-comment-delimiter="//" data-indent-size="1.2em" data-caption-prefix="Algorithm" data-line-number="true">
<div class="pseudocode">
\begin{algorithm} \caption{Policy Iteration for estimating $\pi \approx \pi_*$} \begin{algorithmic} \State \textbf{1. Initialization} \State $V(s) \in \mathbb{R}$ and $\pi(s) \in \mathcal{A}(s)$ arbitrarily for all $s \in \mathcal{S}$ \State \State \textbf{2. Policy Evaluation} \Repeat \State $\Delta \gets 0$ \For{each $s \in \mathcal{S}$} \State $v \gets V(s)$ \State $V(s) \gets \sum_{s', r} p(s', r \mid s, \pi(s)) [r + \gamma V(s')]$ \State $\Delta \gets \max(\Delta, |v - V(s)|)$ \EndFor \Until{$\Delta &lt; \epsilon$} \State \State \textbf{3. Policy Improvement} \State $policy-stable \gets \text{true}$ \For{each $s \in \mathcal{S}$} \State $old-action \gets \pi(s)$ \State $\pi(s) \gets \arg\max_{a} \sum_{s', r} p(s', r \mid s, a) [r + \gamma V(s')]$ \If{$old-action \neq \pi(s)$} \State $policy-stable \gets \text{false}$ \EndIf \EndFor \State \If{$policy-stable$} \State \textbf{stop} and return $V \approx v_*$ and $\pi \approx \pi_*$ \Else \State \textbf{go to 2} \EndIf \end{algorithmic} \end{algorithm}
</div>
</div>
</section>
</section>
<section id="summary" class="level3">
<h3 class="anchored" data-anchor-id="summary">Summary</h3>
<p>Finding an optimal policy by solving Bellman optimality equation requires</p>
<ul>
<li><strong>Markov property</strong></li>
<li><strong>Accurate knowledge</strong> of environment dynamics (known MDP)</li>
<li><strong>Enough space and time</strong> to do the computation</li>
</ul>
<p><strong>Dynamic Programming</strong></p>
<ul>
<li>Under <strong>model-based</strong>, each iteration updates <strong>every value function</strong> in the table using <strong>full backup</strong> <img src="https://latex.codecogs.com/png.latex?%5Crightarrow"> effective for <strong>medium-sized</strong> problems</li>
<li>Usually evaluate <img src="https://latex.codecogs.com/png.latex?V(s)"> instead of <img src="https://latex.codecogs.com/png.latex?Q(s,a)"> because <img src="https://latex.codecogs.com/png.latex?%7Cs%7C%20%5Cll%20%7C(s,a)%7C"></li>
<li>For large problems, DP suffers from the <strong>curse of dimensionality</strong>:
<ul>
<li>The number of states grows exponentially with the number of state variables</li>
</ul></li>
</ul>
</section>
</section>
<section id="reinforcement-learning" class="level2">
<h2 class="anchored" data-anchor-id="reinforcement-learning">Reinforcement Learning</h2>
<section id="reinforcement-learning-1" class="level3">
<h3 class="anchored" data-anchor-id="reinforcement-learning-1">Reinforcement Learning</h3>
<section id="dp-vs-rl" class="level4">
<h4 class="anchored" data-anchor-id="dp-vs-rl">DP vs RL</h4>
<ul>
<li>Dynamic Programming (DP)
<ul>
<li><p>Planning under <strong>model-based</strong> setting using <strong>full-backup</strong>.</p></li>
<li><p>Each iteration updates <strong>every value</strong> in the table using <strong>full backup</strong>.</p></li>
<li><p>Usually, evaluate <img src="https://latex.codecogs.com/png.latex?V(s)"> rather than <img src="https://latex.codecogs.com/png.latex?Q(s,a)">.</p></li>
<li><p>Greedy policy improvement over <img src="https://latex.codecogs.com/png.latex?V(s)"> <strong>requires known MDP</strong>.</p>
<ul>
<li><img src="https://latex.codecogs.com/png.latex?%5Cpi_*(s)%20=%20%5Carg%5Cmax%5Climits_a%5Csum_%7Bs',r%7D%20p(s',r%20%7C%20s,a)%20%5Br%20+%20%5Cgamma%20V%5E*(s')%5D"></li>
</ul></li>
</ul></li>
</ul>
<!-- -->
<ul>
<li>Reinforcement Learning (RL)
<ul>
<li><p>Learning under <strong>model-free</strong> setting using <strong>sample backup</strong>, and approximately solving the Bellman optimality equation.</p></li>
<li><p>Monte Carlo (MC) method</p></li>
<li><p>Temporal Difference (TD) learnings</p>
<ul>
<li>Sarsa</li>
<li>Q-learning</li>
</ul></li>
<li><p>Each iteration updates some values in the table from <strong>sampl backup</strong>.</p></li>
<li><p>We evaluate <img src="https://latex.codecogs.com/png.latex?Q(s,a)"> instead of <img src="https://latex.codecogs.com/png.latex?V(s)">.</p></li>
<li><p>Greedy policy improvement over <img src="https://latex.codecogs.com/png.latex?Q(s,a)"> <strong>works for model-free settings:</strong></p>
<ul>
<li><img src="https://latex.codecogs.com/png.latex?%5Cpi'(s)=%5Carg%5Cmax%5Climits_a%20Q%5E%7B%5Cpi%7D(s,a)"></li>
</ul></li>
</ul></li>
</ul>
</section>
<section id="generalized-policy-iteration-gpi" class="level4">
<h4 class="anchored" data-anchor-id="generalized-policy-iteration-gpi">Generalized Policy Iteration (GPI)</h4>
<ul>
<li><strong>Policy Evaluation</strong> makes the vlaue function “consistent with the current policy”</li>
<li><strong>Policy Improvement</strong> makes the policy “greedy w.r.t. the current value function” <img src="https://selianu.github.io/SelianBlog/posts/Reinforcement Learning 2/GPI.png" class="img-fluid" alt="GPI"></li>
</ul>
</section>
</section>
<section id="monte-carlo-methods" class="level3">
<h3 class="anchored" data-anchor-id="monte-carlo-methods">Monte Carlo Methods</h3>
<section id="monte-carlo-mc" class="level4">
<h4 class="anchored" data-anchor-id="monte-carlo-mc">Monte Carlo (MC)</h4>
<ul>
<li><p>Repeated random sampling to compute numerical results</p></li>
<li><p>Tabular updating and model-free method.</p></li>
<li><p>MC Policy Iteration adapts GPI based on episode-by-episode of PE estimating <img src="https://latex.codecogs.com/png.latex?Q(s,a)=q_%5Cpi(s,a)"> and e-greedy PI.</p></li>
<li><p>MC learns from entire trajectory fo sampled episodes, updating after every single episode using real experience.</p></li>
</ul>
<p><img src="https://latex.codecogs.com/png.latex?%5Ctext%7BValue%7D%20=%20%5Ctext%7Baverage%20of%20returns%20%7D%20G_t%20%5Ctext%7Bof%20sampled%20episodes%7D"></p>
<ul>
<li>MC focuses on a small subset of the states.</li>
<li>이 방법은 ’Successor value estimates’에 의존하지 않기 때문에 Markov property 위반으로 인한 피해가 적다.</li>
</ul>
<p><img src="https://latex.codecogs.com/png.latex?%5CRightarrow%5Ctext%7BNo%20bootstrapping%7D"></p>
</section>
<section id="mc-prediction-policy-evaluation" class="level4">
<h4 class="anchored" data-anchor-id="mc-prediction-policy-evaluation">MC Prediction (Policy Evaluation)</h4>
<ul>
<li><p>Goal: learn <img src="https://latex.codecogs.com/png.latex?q_%5Cpi"> from enitre episodes of real experience under policy <img src="https://latex.codecogs.com/png.latex?%5Cpi">.</p></li>
<li><p>MC Policy Evaluation uses <strong>empirical mean return</strong> instead of expected return.</p></li>
<li><p>To estimate <img src="https://latex.codecogs.com/png.latex?q_%5Cpi(s,a)"></p>
<ol type="1">
<li>For each time step <img src="https://latex.codecogs.com/png.latex?t"> when state <img src="https://latex.codecogs.com/png.latex?s"> is visited and action <img src="https://latex.codecogs.com/png.latex?a"> is taken:
<ul>
<li>Increment count: <img src="https://latex.codecogs.com/png.latex?n(s,a)%20%5Cgets%20n(s,a)%20+%201"></li>
<li>Increment total return: <img src="https://latex.codecogs.com/png.latex?S(s,a)%20%5Cgets%20S(s,a)%20+%20G_t"></li>
<li>Estimate mean return: <img src="https://latex.codecogs.com/png.latex?Q(s,a)%20=%20S(s,a)%20/%20n(s,a)"></li>
</ul></li>
</ol></li>
<li><p>As <img src="https://latex.codecogs.com/png.latex?n(s,a)%20%5Cto%20%5Cinfty">, <img src="https://latex.codecogs.com/png.latex?Q(s,a)%20%5Cto%20q_%5Cpi(s,a)"></p></li>
</ul>
</section>
<section id="incremental-mc-updates" class="level4">
<h4 class="anchored" data-anchor-id="incremental-mc-updates">Incremental MC updates</h4>
<ul>
<li>Incremental Mean
<ul>
<li>Partial mean <img src="https://latex.codecogs.com/png.latex?%5Cmu_k"> of sequence <img src="https://latex.codecogs.com/png.latex?x_1,%20x_2,%20%5Cldots"> is computed incrementally <img src="https://latex.codecogs.com/png.latex?%5Cmu_k%20=%20%5Cfrac%7B1%7D%7Bk%7D%5Csum_%7Bi=1%7D%5Ek%20x_i%20=%20%5Cmu_%7Bk-1%7D%20+%20%5Cfrac%7B1%7D%7Bk%7D(x_k%20-%20%5Cmu_%7Bk-1%7D)"></li>
</ul></li>
<li>Incremental Monte Carlo Updates
<ul>
<li>Increment count: <img src="https://latex.codecogs.com/png.latex?n(S_t,%20A_t)%20%5Cgets%20n(S_t,%20A_t)%20+%201"></li>
<li>Update rule: <img src="https://latex.codecogs.com/png.latex?Q(S_t,%20A_t)%20%5Cgets%20Q(S_t,%20A_t)%20+%20%5Cfrac%7B1%7D%7Bn(S_t,%20A_t)%7D%20%5BG_t%20-%20Q(S_t,%20A_t)%5D"></li>
</ul></li>
</ul>
</section>
<section id="constant-alpha-mc-policy-evaluation" class="level4">
<h4 class="anchored" data-anchor-id="constant-alpha-mc-policy-evaluation">constant <img src="https://latex.codecogs.com/png.latex?%5Calpha"> MC Policy Evaluation</h4>
<ul>
<li><p>In practice, a <strong>step-size</strong> <img src="https://latex.codecogs.com/png.latex?%5Calpha"> is used <img src="https://latex.codecogs.com/png.latex?Q(S_t,%20A_t)%20%5Cgets%20Q(S_t,%20A_t)%20+%20%5Calpha%20%5BG_t%20-%20Q(S_t,%20A_t)%5D"></p></li>
<li><p>Old episodes are exponentially forgotten due to the <img src="https://latex.codecogs.com/png.latex?(1-%5Calpha)Q(S_t,%20A_t)"> term.</p></li>
</ul>
</section>
<section id="exploitation-vs-exploration" class="level4">
<h4 class="anchored" data-anchor-id="exploitation-vs-exploration">Exploitation vs Exploration</h4>
<ul>
<li><strong>Exploitation</strong>: making the best decision by using already known information
<ul>
<li>By selecting the action with the highest Q-value while sampling new episodes, we can refine our policy efficiently from an alreadly promising region in the state-action space.</li>
</ul></li>
<li><strong>Exploration</strong>: searching for new decisions by gathering more information
<ul>
<li>By selecting an extra random action while <img src="https://latex.codecogs.com/png.latex?%5Cepsilon">-probability while sampling new episodes, we can find a new and maybe more promising region within the state-action space.</li>
</ul></li>
<li>To trade-off Exploitation and Exploration, use <strong><img src="https://latex.codecogs.com/png.latex?%5Cepsilon">-greedy policy</strong>.</li>
</ul>
</section>
<section id="mc-control-epsilon-greedy-policy-improvement" class="level4">
<h4 class="anchored" data-anchor-id="mc-control-epsilon-greedy-policy-improvement">MC Control (<img src="https://latex.codecogs.com/png.latex?%5Cepsilon">-Greedy Policy Improvement)</h4>
<ul>
<li><p>Choose the greedy action with probability <img src="https://latex.codecogs.com/png.latex?1-%5Cepsilon"> and a random action with probability <img src="https://latex.codecogs.com/png.latex?%5Cepsilon">. <img src="https://latex.codecogs.com/png.latex?%5Cepsilon/m%20%5Ctext%7B%20for%20each%20of%20all%20%7D%20m%20%5Ctext%7B%20actions%20%7D%20%5CRightarrow%20%5Ctext%7Bstochastic%20policy%7D"></p></li>
<li><p>All actions are selected with non-zero probability for ensuring continual exploration. <img src="https://latex.codecogs.com/png.latex?%5Cpi'(a%7Cs)%20=%20%5Cbegin%7Bcases%7D1%20-%20%5Cepsilon%20+%20%5Cepsilon/m%20&amp;%20%5Ctext%7Bif%20%7D%20a=%5Carg%5Cmax_%7Ba'%7DQ%5E%5Cpi(s,a')%20%5C%5C%20%5Cepsilon/m%20&amp;%20%5Ctext%7Botherwise%20%7D%20(m-1%20%5Ctext%7Bactions%7D)%5Cend%7Bcases%7D"></p></li>
<li><p>In full backup DP, Policy Improvement uses <img src="https://latex.codecogs.com/png.latex?%5Cpi'(s)%20=%20%5Carg%5Cmax_%7Ba%7D%20Q%5E%5Cpi(s,a)%20%5CRightarrow%20%5Ctext%7Bdeterministic%20policy%7D"></p></li>
<li><p>For any <img src="https://latex.codecogs.com/png.latex?%5Cepsilon">-greedy policy <img src="https://latex.codecogs.com/png.latex?%5Cpi">, <img src="https://latex.codecogs.com/png.latex?%5Cepsilon">-greedy policy <img src="https://latex.codecogs.com/png.latex?%5Cpi'"> with respect to <img src="https://latex.codecogs.com/png.latex?q_%5Cpi"> is always improved: <img src="https://latex.codecogs.com/png.latex?v_%7B%5Cpi'%7D%5Cge%20v_%5Cpi(s)"></p></li>
</ul>
</section>
<section id="greedy-in-limit-with-infinite-exploration-glie" class="level4">
<h4 class="anchored" data-anchor-id="greedy-in-limit-with-infinite-exploration-glie">Greedy in Limit with Infinite Exploration (GLIE)</h4>
<ul>
<li><p>All state-action pairs <img src="https://latex.codecogs.com/png.latex?(s,a)"> are explored infinitely many times: <img src="https://latex.codecogs.com/png.latex?%5Clim%5Climits_%7Bk%5Cto%5Cinfty%7Dn_k(s,a)=%5Cinfty"></p></li>
<li><p>The learning policy converges to a greedy policy: <img src="https://latex.codecogs.com/png.latex?%5Clim%5Climits_%7Bk%5Cto%5Cinfty%7D%5Cpi_k(a%7Cs)%20=%201%20%5Ctext%7B%20where%20%7D%20a%20=%20%5Carg%5Cmax%5Climits_%7Ba'%7DQ_k(s,a')"></p></li>
<li><p>GLIE MC Control: <img src="https://latex.codecogs.com/png.latex?%5Cepsilon">-greedy is GLIE if <img src="https://latex.codecogs.com/png.latex?%5Cepsilon_k%5Cto%200"> as follows:</p>
<ul>
<li><p>In the <img src="https://latex.codecogs.com/png.latex?k">-th episode using policy <img src="https://latex.codecogs.com/png.latex?%5Cpi">, for each state-action pair <img src="https://latex.codecogs.com/png.latex?(S_t,A_t)">: <img src="https://latex.codecogs.com/png.latex?n(S_t,%20A_t)%20%5Cgets%20n(S_t,%20A_t)%20+%201Q(S_t,%20A_t)%20%5Cgets%20Q(S_t,%20A_t)%20+%20%5Cfrac%7B1%7D%7Bn(S_t,%20A_t)%7D%20%5BG_t%20-%20Q(S_t,%20A_t)%5D"></p></li>
<li><p>Improve policy based on the new action-value function: <img src="https://latex.codecogs.com/png.latex?%5Cepsilon=1/k,%5Cpi%5Cgets%5Cepsilon%5Ctext%7B-greedy%7D(Q)"></p></li>
</ul></li>
<li><p>GLIE MC Control converges to the optimal: <img src="https://latex.codecogs.com/png.latex?Q(s,a)%5Cto%20q_*(s,a)"></p></li>
</ul>
<div id="algo-monte-carlo-method" class="pseudocode-container quarto-float" data-line-number-punc=":" data-indent-lines="false" data-pseudocode-number="3" data-no-end="false" data-comment-delimiter="//" data-indent-size="1.2em" data-caption-prefix="Algorithm" data-line-number="true">
<div class="pseudocode">
\begin{algorithm} \caption{Monte Carlo Method} \begin{algorithmic} \State \textbf{1. Initialization} \State $Q(s,a)$, all $S\in\mathcal{S}$, $A\in\mathcal{A}(S)$, arbitrarily and $Q(\text{terminal}, \cdot) = 0$ \State Returns $(S,A) \gets$ empty list \State $\pi \gets$ arbitrarily $\epsilon$-soft policy (non-empty probabilities) \State \State \textbf{2. Repeat for each episode} \Repeat \State (a) Generate an episode using $\pi:S_0, A_0, R_1, S_1, \ldots, S_{T-1}, A_{T-1}, R_T$ \State $G\gets 0$ \State (b) \For{each step of episode, $t=T-1, T-2, \ldots, 0$} \State $G \gets \gamma G + R_{t+1}$ \State Unless $(S_t, A_t)$ appears in $(S_0, A_0), \ldots, (S_{t-1}, A_{t-1})$: ignore this lien for every-visit MC \State Append $G$ to Returns $(S_t, A_t)$ \State $Q(S_t, A_T)\gets$ average(Returns$(S_t, A_t)$) \EndFor \For{each $S_t$ in the episode} \State $A^*\gets\arg\max_{a}Q(S_t,a)$ \For{all $a\in\mathcal{A}(S_t)$} \State $\pi(a|S_t) \gets \begin{cases}1-\epsilon + \epsilon/|\mathcal{A}(S_t)| &amp; \text{if } a=A^* \\ \epsilon/|\mathcal{A}(S_t)| &amp; \text{otherwise}\end{cases}$ \EndFor \EndFor \Until{$\text{forever}$} \end{algorithmic} \end{algorithm}
</div>
</div>


</section>
</section>
</section>

 ]]></description>
  <category>RL</category>
  <guid>https://selianu.github.io/SelianBlog/posts/Reinforcement Learning 2/</guid>
  <pubDate>Tue, 14 Apr 2026 00:00:00 GMT</pubDate>
  <media:content url="https://selianu.github.io/SelianBlog/posts/Reinforcement Learning 2/GPI.png" medium="image" type="image/png" height="49" width="144"/>
</item>
</channel>
</rss>
