You are encouraged to work on this homework assignment in groups
of up to three students. If you work in a group, you only need to turn in one shared
solution. All students in the group will receive the same grade on the
assignment. If you choose to work in a group, you must actually
produce group solutions, not have each member
work independently on one part of the assignment, then submit the
collection of independent solutions. To acknowledge that you have
worked as intended, you must include the
following statement at the top of the submitted assignment:
"We, <students' names>, worked equally as a group
on this assignment, and each of us fully understands and acknowledges
the group solutions that we are submitting. We understand that we will
receive a common grade for this assignment."
Please be sure to include all group members' names on the assignment!
Outlook | Temp (F) | Humidity (%) | Windy? | Class |
sunny | 75 | 70 | true | Play |
sunny | 80 | 90 | true | Don't Play |
sunny | 85 | 85 | false | Don't Play |
sunny | 72 | 95 | false | Don't Play |
sunny | 69 | 70 | false | Play |
overcast | 72 | 90 | true | Play |
overcast | 83 | 78 | false | Play |
overcast | 64 | 65 | true | Play |
overcast | 81 | 75 | false | Play |
rain | 71 | 80 | true | Don't Play |
rain | 65 | 70 | true | Don't Play |
rain | 75 | 80 | false | Play |
rain | 68 | 80 | false | Play |
rain | 70 | 96 | false | Play |
The diagram below shows a gridworld domain in which the agent starts at the upper left location. The upper and lower rows are both "one-way streets," since only the actions shown by arrows are available.
Actions that attempt to move the agent
into a wall (the outer borders, or the thick black wall
between all but the leftmost cell of the top and bottom
rows) leave the agent in the same state it was in with
probability 1, and have reward -2.
If the agent tries to move to the right from the upper right
or lower right locations, with probability 1,
it is teleported to the far left end
of the corresponding row, with reward as marked (-10 and +20, respectively).
All other actions have
the expected effect (move up, down, left, or right)
with probability .9, and leave the agent in the same state it was
in with probability .1. These actions all have reward
-1,
except for the transitions that are marked in the upper left and
lower left cells. (Note that the
marked transitions only give the indicated reward if the action
succeeds in moving the agent in that direction.)
(a) MDP (10 pts) Give the MDP for this domain
only for the state transitions starting from each of
the states in the top row, by filling in a
state-action-state
transition table (showing only the state transitions
with non-zero probability). You should refer to each state
by its row and column index, so the upper left state is [1,1]
and the lower right state is [2,4].
To get you started, here are the first few lines of the table:
State s | Action a | New state s' | p(s'|s,a) | r(s,a,s') |
[1,1] | Up | [1,1] | 1.0 | -2 |
[1,1] | Right | [1,1] | 0.1 | -1 |
[1,1] | Right | [1,2] | 0.9 | +20 |
(b) Value function (10 pts)
Suppose the agent follows a randomized policy π (where each
available action in any given state has equal probability)
and uses a discount factor of γ=.85.
Given the partial value function (Vπ; Uπ in
Russell & Norvig's terminology) shown below,
fill in the missing Vπ values. Show and explain your work.
(c) Policy (10 pts)
Given the value function Vπ computed in (b),
what new policy π' would policy iteration produce at
the next iteration? Show your answer as a diagram (arrows
on the grid) or as a state-action table.
(a) Iterated Prisoner's Dilemma (5 pts)
C | D | |
C | 3,3 | 0,5 |
D | 5,0 | 1,1 |
(b) Rock-Paper-Scissors (5 pts)
Also called "Roshambo," each player chooses to present one of three objects: rock, paper, or scissors. Rock breaks (beats) scissors; paper covers (beats) rock; scissors cuts (beats) paper. Nobody wins (it's a tie) if both players pick the same object.
R | P | S | |
R | 0,0 | -1,+1 | +1,-1 |
P | +1,-1 | 0,0 | -1,+1 |
R | -1,+1 | +1,-1 | 0,0 |
(c) Chicken (5 pts)
Two drivers are headed for a one-lane bridge. If they both swerve out of the way, they "tie" (nobody scores). If one swerves and the other drives straight on, the "chicken" loses a point and the gutsy driver gains a point. If neither swerves, they both lose big.
Straight | Swerve | |
Straight | -10,-10 | +1,-1 |
Swerve | -1,+1 | 0,0 |
State the Nash equilibria of your game and explain why they are the equilibria. Also indicate what the social welfare maximizing strategy sets are for your game. Will rational players maximize social welfare in your game?
Do you think that the "tit-for-tat" strategy, which has been used successfully with the IPD, would work well for a player in your game? Why or why not?