Conge 精进

Reinforcement Learning 第十四周课程笔记

2015-11-19
本文 3014 字,阅读全文约需 9 分钟

This week 

  • should watch CCC.
  • The readings are: Zeibart et al. (2008)Babes et al. (2011)Griffith et al (2013). Cederborg et al (2015). Roberts (2006). Bhat (2007).

CCC

Coordinating and communicating

The decentralized partially observable Markov decision process (*Dec*-*POMDP*)

  • Dec-POMDP has some perspectives of game theory and MDP
  • Multiple agent working on getting a common reward. (if the rewards are separated for all the agents, then it’s a POSG partially observable stochastic game)

DEC-POMDPs properties

DEC-POMDPs example

  • two agents, they know where they are but don’t know the other’s position. when the two are in the same room, they win.
  • Strategy: go to a shared room. But my knowledge of my current position could be wrong ( partially observable world).

Communicating and Coaching

  • agent 1 wants to set up some kind of reward function to move agent to do something (e.g. get the apple for me).

Inverse Reinforcement Learning

Inverse Reinforcement Learning

  • Inverse Reinforcement Learning: the agent experience the environment and a set a behavior and then generate a reward function based on the inputs.

MLIRL: Maximium Likelyhod inverse reinforcement learning.

MLIRL result

CCC

Policy Shaping

Policy Shaping

  • if a human is giving feedback (commentary) about weather the agent’s action is good or bad, s/he is doing policy shaping.
  • policy shaping could be realized by reward shaping which is replace reward of an action with a new reward?
  • Agent need a mechanism to learn from the environment and the commentary to decide what policy to take (not just listening to the commentary, cause the commentary might not be always right).

quiz 1:  Policy Shaping 

  • If human is alway correct, given the feedback, what’s the probability that the action (x, y, or z) is optimal?
  • answers in the slides above.

Quiz 2: Policy Shaping

  • what if human is 0.8 probability of right?
  • counting method:
    • saying x is optimal is liking saying y and z is not optimal.
    • since human is 0.8 correct, then x, y, z being optimal is 0.8, 0.2, 0.2.
    • normalize the numbers above, will get 2/3, 1/6, 1/6.

Policy Shaping probabiligy calculation

  • Δa is coming from data of action a (da). C is the probability of correct of the people giving commentary.
  • The formula above give the method of calculating probability of action a is optimal.
  • Note: the final probability will need to be normalized against the probabilities of other actions.

quiz 3: How to combine info from multiple sources in Policy shaping?

  • in the policy shaping case, information are coming from multiple sources.
  • E.g. πa and πH are policy info from agent exploring the world and human giving feedback.
  • Some algorithm decrease the importance of πH as time goes. One need to know that πa already incorporated the information of human uncertainty (C).
  • the way to combine the two sources is to calculate the probability that the two policy will agree: aopt=argmaxa p(a π1) * p(a π2).
    • in the quiz xopt = 1/15, yopt=1/60,aopt=2/15. So we should choose z as optimal.

Drama Management

Drama management world

  • the way a human can communicate to an agent
    • demonstration: show the agent what’s the correct action (inverse RL)
    • reward shaping: giving reward for agent’s actions
    • policy shaping: commentary on the agent’s actions
  • author convey his intent to the agent so the agent can

Drama Management: what's a stroy

  • story can be defined as a trajectory through plot points

Trajectories as MDP

  • above a some mapping of MDP elements to trajectory MDP elements
  • Problems
    • large number of sequence of states (hyper exponential)
    • Since MDP will maximize rewards, treating story as an MDP will only make the author happy and force the player to experience the story.

TTD-MDP: Targeted trajectory distributions MDPs

  • p(t’ a,t) is the probability that the player at trajectory t and take action a then ended up in trajectory t’. P(T) is a target distribution.
  • the action is not player’s action but the story action
  • the optimal policy is the policy that will lead to the targeted trajectory distribution P(T)
  • the calculation time is linear and dependent on the length of the story.

what have we learned

recap

2015-11-18 初稿 完成
原文地址 https://conge.livingwithfcs.org/2015/11/19/Reinforcement-Learning-di-shi-si-zhou-ke-cheng-bi-ji/
Paypal
请我喝咖啡

Comments

Content