Conge 精进

Reinforcement Learning 第四周课程笔记

2015-09-06
本文 1480 字,阅读全文约需 5 分钟

本周三件事:看课程视频,阅读 Sutton (1988),作业3(HW3)。

以下为视频截图和笔记:

Temporal Difference Learning

Read Sutton 1988 first

  • Read Sutton, Read Sutton, Read Sutton. Because the final project was based on it!

Three families of RL algorithms

  1. Model based
  2. Model free
  3. Policy search
  • Form 1 –> 3: more direct learning
  • From 3 –> 1 more supervised

TD-lambda

TD-lambda

Quiz 1: TD-lambda Example

  • in this case the model is known, the calculation is easy.

Quiz 2: Estimating from Data

  • Remember from the previous lecture, we need to get value from each episode and average over them.

Computing Estimates Incrementally

  • The rewrite makes the formula looks a lot like neuro-net learning. and alpha is introduced.

Quiz 2: alpha will mke learning converge (tips:if 指数i大于1, 1/(T)<sup>i</sup> will be bounded

TD (1) rule

TD(1) with and without repeated states

  • When no repeated states, the TD(1) is the same as outcome-based updates ( which is see all the rewards in each state and update weights).
  • when there is repeated states, extra learning happens.

Why TD(1) is "Wrong"

  • in case of TD(1) rule, V(s2) can be estimated by average episodes. we only see V(s2) once and the value is 12. Then V(s2) = 12
  • in case of Maximum likelihood estimates, we have to kind of learn the transition from data. e.g. for the first 5 episodes, we saw s3->s4 3 times and s3 -> s5 2 times. So the transition probability can be extracted from data as 0.6 and 0.4 respectively.

TD(0) Rule

  • First of all, if we have infinite data, TD(1) will also do the right thing.
  • When we have finite data, we can repeatedly infinitely sample the data to figure out all the ML. This is what TD(0) do.

Connecting TD(0) and TD(1)

K-Step Estimators

K-Step Estimators

  • E1 is one-step estimator (one-step look up) TD(0)
  • E2 is two-step estimator, and Ek is k-step lookup.
  • When K goes to infinity, we got TD(1)

K-step Estimators and TD-lambda

TD-lambda can be seen as weighted combination of K-step estimators. the weight factor are λk(1-λ).

Why use TD-lambda?

The best performed lambda is typically not TD(0), but some λ in between 0 and 1.

Summary

2015-09-5 初稿
2015-12-03 reviewed and revised until the "Connecting TD(0) and TD(1)" slides
原文地址 https://conge.livingwithfcs.org/2015/09/06/Reinforcement-Learning-di-si-zhou-ke-cheng-bi-ji/
Paypal
请我喝咖啡

Comments

Content