defined a proper policy for an MDP as one that is guaranteed to reach a terminal state. Show that it is possible for a passive ADP agent to learn a transition model for which its policy is improper even if is proper for the true MDP; with such models, the value determination step may fail if y=1. Show that this problem cannot arise if value determination is applied to the learned model only at the end of a trial.
defined a proper policy for an MDP as one that is guaranteed to reach a terminal state. Show that it is possible for a passive ADP agent to learn a transition model for which its policy is improper even if is proper for the true MDP; with such models, the value determination step may fail if y=1. Show that this problem cannot arise if value determination is applied to the learned model only at the end of a trial.
Operations Research : Applications and Algorithms
4th Edition
ISBN:9780534380588
Author:Wayne L. Winston
Publisher:Wayne L. Winston
Chapter17: Markov Chains
Section17.4: Classification Of States In A Markov Chain
Problem 3P
Related questions
Question
Expert Solution
This question has been solved!
Explore an expertly crafted, step-by-step solution for a thorough understanding of key concepts.
Step by step
Solved in 3 steps
Knowledge Booster
Learn more about
Need a deep-dive on the concept behind this application? Look no further. Learn more about this topic, computer-science and related others by exploring similar questions and additional content below.Recommended textbooks for you
Operations Research : Applications and Algorithms
Computer Science
ISBN:
9780534380588
Author:
Wayne L. Winston
Publisher:
Brooks Cole
Operations Research : Applications and Algorithms
Computer Science
ISBN:
9780534380588
Author:
Wayne L. Winston
Publisher:
Brooks Cole