Conge 精进

Reinforcement Learning 第八周课程笔记

2015-10-13
本文 5217 字,阅读全文约需 15 分钟

This week

  • Watch Exploration.
  • The readings were Fong (1995) and Li, Littman, Walsh (2008).

Exploration: Specific to RL

Sub topics of Exploration

Type state transition Stochastic solution
Bandits ✘  ✔  hoeffding bound to do stochastic decision making
Deterministic MDPs: ✔  ✘  Rmax to do sequential decision making
Stochastic MDPs ✔  ✔  both hoeffding bound and Rmax

K-armed Bandit Problem

  • Each slot has a probability to get certain payoff or no payoff
  • Payoff or the associated probabilities are unknown.
  • what’s the best thing to do? Exploration.

Confidence based exploration

!Quiz 1: K-Armed Bandits, given the observed data, which arm gives the ightest expected payoff and which arm are most comfident](/assets/images/计算机科学/118382-6578df35f8c00073.png)

  • arm d gives 1 pay off in the smallest number of try
  • arm g gives most reliable output since it’s been pulled the most and gives a good payoff.

Confidence-based Exploration

  • Maximum Likelihood: ① Given current info, figure out the best expected action based on best likelihood -> ② act, and get new info -> repeat ① ② will always choose b
  • Maximum Confidence: ① Given current info, figure out the best expected action based on best confidence -> ② act, and get new info -> repeat ① ② will also choose b only
  • Minimum Confidence: ① Given current info, figure out the best expected action based on smallest confidence -> ② act, and get new info -> repeat ① ② will also choose to alternating between a and b. But it will not favor the better arm.
  • Exploration-exploitation dilemma: combine maximum maximum likelihood and minimum confidence. The latter allows to try different actions and the former allows us to choose the best reward.

Metrics for Bandits

Metrics for Bandits

  1. identify optimal arm in the limit: exploration
  2. maximize (discounted) expected rewards: exploration + exploitation, but computationaly expensive
    • Gittins index: a real scalar value associated to the state of a stochastic process with a reward function and a probability of termination.
    • Gittins index works well with Bandits, should not generalize to other RL.
  3. Maximize expected reward in finite horizon - similar to value iteration
  4. Identify near-optimal arms (ε) with high probabilities ( 1- δ) in time τ(k, ε, δ) (polynomial in k, 1/ε, 1/δ) - The “get there” goal is to find near optimal arms. similar to PAC learning: probably approximately correct.
  5. Nearly Maximize reward (ε) with high probabilities ( 1- δ) in time τ’(k, ε, δ) (polynomial in k, 1/ε, 1/δ): the “how to get there” goal is to accumulate as much more rewards as possible in finite time τ’.
  6. Pull a non-near optimal arm (ε) no more than τ’‘(k, ε, δ) times with high probabilities ( 1- δ): The “control error” goal is to minimize mistakes.

Find Best Implies Few Mistakes

Find Best Implies Few Mistakes

  • τ’ is the total number of mistakes when we find ε’ close arm
  • τ’ is bounded to τ.

Few Mistakes Implies Do Well

Few Mistakes Implies Do Well

  • is it true that the smaller the m is, the smaller the n is? not always.
  • but it is true that the smaller the m is, the small the n could be.
  • “Do well” means algorithm C can lead to ε’ close to optimal arm where ε’ < ε.

Do Well Implies Find Best

Do Well Implies Find Best

Putting it together

  • Given algorithm of one goal, we can derive algorithms to reach the other two.
  • So there is no need to pick one.
  • need to know that the three goals are not always solvable (so you might have to pick one?).

Hoeffding

Hoeffding bounds

  • 0 <= δ <=1. The more confident we are the smaller the δ -> the larger the Zδ -> the larger the confidence interval
  • the larger the n is, the smaller the confidence interval

Combining Arm Info

Combining Arm Info

  • the length of the bar, ε, is the confidence interval of μk.

How Many Samples?

quiz 2

  • C depends more on ε and less on δ (δ is in ln)
  • PAC-style bounds for bandits.

Exploring Deterministic MDPs

MDP optimization Criteria

  • in MDP, we can only choose action but not state
  • we can treat action selection as bandit problem (e.g. k action is a k arm bandit).

Exploring MDP

  • Rmax: assume any state we have not full explored (unknown state-action pair) has a self-loop of reward Rmax.
  • This way, the learner is encouraged to explore the whole MDP.

Rmax Analysis

Rmax Analyis

  • the example: even with Rmax, there is a possibility that the learning stops visiting unknown states
  • discount factor is important to determine if the agent is going to explore unknown edge.

Lower bound using a special kind of MDP: sequential combination lock

  • We have a upper bound saying the algorithm can solve the problem with n2k steps
  • now we are proving that no other algorithms can take less than n2k steps
  • The sequential combination lock has n states, in each state the learner can take k actions.
  • at i th state, one action can send the learner to next state (i + 1), all other actions will take the learner to state 0, to get back to state i, the learner has to take at least i steps (the worst case deterministic MDP).
  • In the worst case, it might take kn2 steps to solve a deterministic MDP before stopping make mistakes.

General Rmax

For Stochastic properties, applying hoeffding bound. For sequential deterministic MDP, Rmax can solve the problem. Now applying both, we can solve the stochastic MDP.

General Rmax

  • For Rmax algorithm, define Unknown s, a pair as: if tried fewer than C times. Then use maximum likelihood estimate.
  • the C part is hoeffding bound.

Some key ideas for this to work: Simulation Lemma and Explore-or-Exploit Lemma

Simulation Lemma

Simulation Lemma

  • if we get an estimated MDP that is close to real MDP, then the optimizing the reward based on the estimate will be very close to the optimal policies for the real MDP.
  • α is the difference between estimated and real MDP.
  • α could be different if transition possibility and reward are not in the same scale, but we can alway rescale things to that α is between 0 and 1.

the difference between estimated and real MDP (α) is polynomially related to the error in value of actions (ε)

  • α is related to ε 
  • α is used to determine C
  • using Hoeffding bound and union bound to find a big enough C which can estimate an MDP which is close enough to real MDP.

Explore-or-Exploit Lemma

Explore-or-Exploit

What Have We Learned?

Recap

  • Hoeffding bound and Union bound let us know how certain we are about the environment
  • Rmax makes sure we “visited” things enough to get accurate estimates
2015-10-06 初稿 up-to quiz 1
2015-10-07 updated to Exploring Deterministic MDPs
2015-12-04 reviewed and revised
原文地址 https://conge.livingwithfcs.org/2015/10/13/Reinforcement-Learning-di-ba-zhou-ke-cheng-bi-ji/
Paypal
请我喝咖啡

Comments

Content