Description
Part-A
(Stochastic Maze) [10 Marks] The stochastic maze environment is shown in the figure .
There are a total of 12 states in the environment represented by the indices {0, 1, ⋯ , 11}. The
agent starts in the initial state 0.
Four actions are possible in each state: left, right, up, and
down. The environment is stochastic and we take the intended action only with a probability
𝑝 = 0.8. We take an orthogonal action with a probability 𝑝 = 0.2. If we collide with the edge
of the environment or the wall present in state 5, the agent comes back to the same state.
The
transition into the goal state 3 has a +1 reward and transition into the hole state 7 has a -1
reward. All other transitions have a -0.01 reward associated with them.
Given an implementation of the environment in the template notebook, answer the following
questions:
1
Figure 1: Stochastic Maze Environment
1. Find an optimal policy to navigate the given environment using Policy Iteration (PI) [10
Marks]
2. (Bonus) Find an optimal policy to navigate the given environment using Value Iteration
(VI) [3 Marks]
3. (Bonus) Compare PI and VI in terms of convergence. Is the policy obtained by both
same? [2 Marks]
2
Assignment 2
EE675: Introduction to Reinforcement Learning
Part-B
(Cliff Walking) [10 Marks] Through this grid world exercise we will compare SARSA and
Q-learning algorithms, highlighting the difference between them. Consider the grid world
shown in the Figure below.
This is a standard undiscounted, episodic task, with start and goal
states, and the usual actions causing movement up, down, right, and left. Reward is -1 on all
transitions except those into the the region marked “The Cliff.” Stepping into this region incurs
a reward of -100 and sends the agent instantly back to the start. The episode ends when the
agent reaches the goal state.
Given the template notebook skeleton code, Implement and answer the following questions in
the notebook:
1. Implement and compare the SARSA and Q-Learning methods [6+6 Marks]
2. Why is Q-learning considered an off-policy control method? [2 Marks]
3. Which algorithm takes a safer path? Why? [1 Marks]
1
Figure 1: Cliff Walking Environment
4. (Bonus) Suppose while learning the action selection is greedy. Is Q-learning then exactly
the same algorithm as SARSA? Will they make exactly the same action selections and
weight updates? [2 Marks]
2



