The direct utility estimation method in Section passive-rl-section uses distinguished terminal states to indicate the end of a trial. How could it be modified for environments with discounted rewards and no terminal states?
      Answer
      Improve This Solution
    
    
  View Answer