defined a proper policy for an MDP as one that is guaranteed to reach a terminal state. Show that it is possible for a passive ADP agent to learn a transition model for which its policy is improper even if is proper for the true MDP; with such models, the value determination step may fail if y=1. Show that this problem cannot arise if value determination is applied to the learned model only at the end of a trial.

defined a proper policy for an MDP as one that is guaranteed to reach a terminal state. Show that it is possible for a passive ADP agent to learn a transition model for which its policy is improper even if is proper for the true MDP; with such models, the value determination step may fail if y=1. Show that this problem cannot arise if value determination is applied to the learned model only at the end of a trial.

Operations Research : Applications and Algorithms

4th Edition

ISBN:9780534380588

Author:Wayne L. Winston

Publisher:Wayne L. Winston

Chapter17: Markov Chains

Section17.4: Classification Of States In A Markov Chain

Problem 3P

See similar textbooks

Similar questions

This question concerns model optimisation in machine learning. The k-means algorithm is said to converge to local minima, rather than to the global minimum. (i) Considering the risk of converging to a local minimum, design a strategy that can improve the solution provided by the k-means algorithm?
Given a Markov reward process with the attached graph: 1. If the values of the states A, B, and C are 2, 1, and 3 respectively, update the value of the state B in each of the 3 frameworks (TD(0), MC, and DP), given the following trace for MC and that the probability of moving right or left is 0.5 in each state. Trace for MC: B -> A -> B -> C -> T
True or False. Explain a. It is quite common to experience an unbounded divergence when we use the deadly triad in Deep Q-Learning: bootstrapping, value function approximation, and off-policy learning. b. As the size of an MDP (i.e., the number of states or actions) increases, an RL agent is required to store more weight parameters for the linear function approximation of the value function.
In deep learning, is each feature map in a convolutional neural network extracted from input by a filter by a) using an affine transformation (i.e. biased convolution), a nonlinear activation and a pooling function? or b) by using an affine transformation (i.e. biased linear combination) and a pooling function?
Given a Markov reward process: If the values of the states are initialized to 0, and the probabilities are 0.5 for the transitions, hand-simulate 2-step TD(0) for an episode that has trace C - D - C - D - E - T
Consider an agent for a vacuum cleaner environment in which the geography of the environment (extent, boundaries, and obstacles) is unknown as is the initial dirt configuration. The agent can go up and down as well as left and right. Can a simple reflex agent be perfectly rational for this environment? Explain in a few sentences using an example scenario
Implement a passive learning agent in a simple environment, such as the 4 x 3 world. For the case of an initially unknown environment model, compare the learning performance of the direct utility estimation, TD, and ADP algorithms. Do the comparison for the optimal policy and for several random policies. For which do the utility estimates converge faster! What happens when the size of the environment is increased? (Try environments with and without obstacles.)
Consider an undiscounted MDP having three states, (1, 2, 3), with rewards -1, -2, 0, respectively. State 3 is a terminal state. In states 1 and 2 there are two possible actions: a and b. The transition model is as follows: - In state 1, action a moves the agent to state 2 with probability 0.6 and makes the agent stay put with probability 0.4. In state 2, action a moves the agent to state 1 with probability 0.6 and makes the agent stay put with probability 0.4 - In either state 1 or state 2, action b moves the agent to state 3 with probability 0.2 and makes the agent stay put with probability 0.8. Answer the following questions: 1. What can be determined qualitatively about the optimal policy in states 1 and 2? 2. Apply policy iteration, showing each step in full, to determine the optimal policy and the values of states 1 and 2. Assume that the initial policy has action b in both states. 3. What happens to policy iteration if the initial policy has action a in both states? Does…
Assign the best “possible action” to exactly one training curve. Use a possible explanation at most once. Assume these are learning curves for a model trained to classify CIFAR images using PyTorch trained with a typical cross entropy loss. Choose the most fitting answer.
Consider the case of a simple Markov Decision Process (MDP) with a discount factor gamma = 1. The MDP has three states (x, y, and z), with rewards -1, -2, 0, respectively. State z is considered a terminal state. In states and y there are two possible actions: a₁ and a2. The transition model is as follows: In state x, action a1 moves the agent to state y with probability 0.9 and makes the agent stay put with probability 0.1. In state y, action a1 moves the agent to state with probability 0.9 and makes the agent stay put with probability 0.1. In either state or state y, action a2 moves the agent to state z with probability 0.1 and makes the agent stay put with probability 0.9. Please answer the following questions: Draw a picture of the MDP What can be determined qualitatively about the optimal policy in states x and y? Apply the policy iteration algorithm discuss in class, showing each step in full, to determine the optimal policy and the…
Use the Subset construction, as defined in the lecture notes, to construct DFA from the NFA produced by Thompson's construction on the regular expression x* ( y | € ) ( € | y) How many x-transitions (edges labelled x) does the DFA have? Choose... v How many accepting states does the DFA have? Choose... v How many states does the DFA have? Choose... How many y-transitions (edges labelled y) does the DFA have? Choose... v How many e-transitions (edges labelled € does the DFA have? Choose... v
Naive Bayes Show using D-separation that two features in a Naive Bayes model are conditionally independent given the label.