Select a specific member of the set of policies that are optimal for $R(s)>0$ as shown in Figure sequential-decision-policies-figure(b), and calculate the fraction of time the agent spends in each state, in the limit, if the policy is executed forever. (Hint: Construct the state-to-state transition probability matrix corresponding to the policy and see Exercise markov-convergence-exercise.)
Answer
Improve This Solution
View Answer