Markov Decision Processes
- Due:
- 11:00pm, Friday March 4, 2022
Description
This assignment is a modified version of an assignment from Stanfordโs CS221 course (source)
Markov decision processes (MDPs) can be used to formalize uncertain situations. In this homework, you will implement algorithms to find the optimal policy in these situations. You will then formalize a modified version of Blackjack as an MDP, and apply your algorithm to find the optimal policy.
Problem 1: Value Iteration
In this problem, you will perform the value iteration updates manually on a very basic game just to solidify your intuitions about solving MDPs. The set of possible states in this game is ๐ฎโ=โ{โ โโ 2,โโ โโ 1,โ0,โโ +โ 1,โโ +โ 2} and the set of possible actions is ๐โ=โ{a1,โa2}. The initial state is 0 and there are two terminal states, โ โโ 2 and โ +โ 2. Recall that the transition function ๐ฏโ:โ๐ฎโ รโ ๐โโโฮ(๐ฎ) encodes the probability of transitioning to a next state sโฒ after being in state s and taking action a as ๐ฏ(sโฒ|s,โa). In this MDP, the transition dynamics are given as follows:
โiโโโ{โ โโ 1,โ0,โ1}โโโ๐ฎ,
- ๐ฏ(iโ โโ 1|i,โa1)โ=โ0.8 and ๐ฏ(iโ +โ 1|i,โa1)โ=โ0.2
- ๐ฏ(iโ โโ 1|i,โa2)โ=โ0.7 and ๐ฏ(iโ +โ 1|i,โa2)โ=โ0.3
Think of this MDP as a chain formed by states {โ
โโ
2,โโ
โโ
1,โ0,โโ
+โ
1,โโ
+โ
2}. In words, action a1 has a 80% chance of moving the agent backwards in the chain and a 20% chance of moving the agent forward. Similarly, action a2 has a 70% of sending the agent backwards and a 30% chance of moving the agent forward. We will use a discount factor ฮณโ=โ1. The reward function for this MDP is
$$\mathcal{R}(s,a,s') =
\begin{cases} 20 & s' = -2 \\
100 & s' = +2 \\
-5 & \text{otherwise}
\end{cases}$$
Part a (written)
Give the value of Viโ(s) for each state in ๐ฎ after iterations iโ=โ{0,โ1,โ2} of Value Iteration. Recall that โsโโโ๐ฎ, V0โ(s)โ=โ0 and, for any terminal state sterminal, Vโ(sterminal)โ=โ0. In words, all values are initialized to 0 at iteration 0 and terminate states (for which the optimal policy is not defined) always have a value of 0.
Part b (written)
Using V2โ(โ โ โ ), what is the corresponding optimal policy ฯโ for all non-terminal states?
Problem 2: Peeking Blackjack
Now that we have gotten a bit of practice with general-purpose MDP algorithms, letโs use them to play (a modified version of) Blackjack. For this problem, you will be creating an MDP to describe states, actions, and rewards in this game. More specifically, after reading through the description of the state representation and actions of our Blackjack game below, you will implement the transition and reward function of the Blackjack MDP inside succAndProbReward()
.
For our version of Blackjack, the deck can contain an arbitrary collection of cards with different face values. At the start of the game, the deck contains the same number of each cards of each face value; we call this number the โmultiplicityโ. For example, a standard deck of 52 cards would have face values [1,โ2,โโฆ,โ13] and multiplicity 4. You could also have a deck with face values [1,โ5,โ20]; if we used multiplicity 10 in this case, there would be 30 cards in total (10 each of 1s, 5s, and 20s). The deck is shuffled, meaning that each permutation of the cards is equally likely.
The game occurs in a sequence of rounds. In each round, the player has three actions available to her:
- atake - Take the next card from the top of the deck.
- apeek - Peek at the next card on the top of the deck.
- astop - Stop taking any more cards.
In this problem, your state s will be represented as a 3-element tuple:
(totalCardValueInHand, nextCardIndexIfPeeked, deckCardCounts)
As an example, assume the deck has card values [1,โ2,โ3] with multiplicity 1, and the threshold is 4. Initially, the player has no cards, so her total is 0; this corresponds to state (0, None, (1, 1, 1))
.
For atake, the three possible successor states (each with equal probability of 1/3) are:
(1, None, (0, 1, 1)) (2, None, (1, 0, 1)) (3, None, (1, 1, 0))
In words, a random card that is available in the deck is drawn and its corresponding count in the deck is then decremented. Remember that
succAndProbReward()
will expect you return all three of the successor states shown above. Note that โ(s,โatake,โsโฒ)โ=โ0,โโs,โsโฒโโโ๐ฎ. Even though the agent now has a card in her hand for which she may receive a reward at the end of the game, the reward is not actually granted until the game ends (see termination conditions below).For apeek, the three possible successor states are:
(0, 0, (1, 1, 1)) (0, 1, (1, 1, 1)) (0, 2, (1, 1, 1))
Note that it is not possible to peek twice in a row; if the player peeks twice in a row, then
succAndProbReward()
should return[]
. Additionally, โ(s,โapeek,โsโฒ)โ=โโ โโ peekCost,โโs,โsโฒโโโ๐ฎ. That is, the agent will receive an immediate reward of-peekCost
for reaching any of these states.Things to remember about the states after taking apeek:
From
(0, 0, (1, 1, 1))
, taking a card will lead to the state(1, None, (0, 1, 1))
deterministically (that is, with probability 1.0).The second element of the state tuple is not the face value of the card that will be drawn next, but the index into the deck (the third element of the state tuple) of the card that will be drawn next. In other words, the second element will always be between 0 and
len(deckCardCounts)-1
, inclusive.
For astop, the resulting state will be
(0, None, None)
. (Remember that setting the deck toNone
signifies the end of the game.)
The game continues until one of the following termination conditions becomes true:
The player chooses astop, in which case her reward is the sum of the face values of the cards in her hand.
The player chooses atake and โgoes bustโ. This means that the sum of the face values of the cards in her hand is strictly greater than the threshold specified at the start of the game. If this happens, her reward is 0.
The deck runs out of cards, in which case it is as if she selects astop, and she gets a reward which is the sum of the cards in her hand. Make sure that if you take the last card and go bust, then the reward becomes 0 not the sum of values of cards.
(3, None, (1, 1, 0))
, and the threshold remains 4. - For astop, the successor state will be (3, None, None)
.
- For atake, the successor states are
(3 + 1, None, (0, 1, 0))
or(3 + 2, None, None)
.
Note that in the second successor state, the deck is set to None
to signify the game ended with a bust. You should also set the deck to None
if the deck runs out of cards.
Part a (code)
Implement the game of Blackjack as an MDP by filling out the succAndProbReward()
function of class BlackjackMDP
.
Part b (code)
Letโs say youโre running a casino, and youโre trying to design a deck to make people peek a lot. Assuming a fixed threshold of 20, and a peek cost of 1, design a deck where for at least 10% of states, the optimal policy is to peek. Fill out the function peekingMDP()
to return an instance of BlackjackMDP
where the optimal action is to peek in at least 10% of states. Hint: Before randomly assigning values, think of the case when you really want to peek instead of blindly taking a card.
Problem 3: Learning to Play Blackjack
So far, weโve seen how MDP algorithms can take an MDP which describes the full dynamics of the game and return an optimal policy. But suppose you go into a casino, and no one tells you the rewards or the transitions. We will see how reinforcement learning can allow you to play the game and learn its rules & strategy at the same time!
Part a (code)
You will first implement a generic Q-learning algorithm QLearningAlgorithm
, which is an instance of an RLAlgorithm
. As discussed in class, reinforcement learning algorithms are capable of executing a policy while simultaneously improving that policy. Look in simulate()
, in util.py
to see how the RLAlgorithm
will be used. In short, your QLearningAlgorithm
will be run in a simulation of the MDP, and will alternately be asked for an action to perform in a given state (QLearningAlgorithm.getAction
), and then be informed of the result of that action (QLearningAlgorithm.incorporateFeedback
), so that it may learn better actions to perform in the future.
We are using Q-learning with function approximation, which means Qฬโ(s,โa)โ=โ๐จโ
โ
โ
ฯ(s,โa), where in code, ๐จ is self.weights
, ฯ is the featureExtractor
function, and Qฬโ is self.getQ
.
We have implemented QLearningAlgorithm.getAction
as a simple ฯต-greedy policy. Your job is to implement QLearningAlgorithm.incorporateFeedback()
, which should take an (s,โa,โr,โsโฒ) tuple and update self.weights
according to the standard Q-learning update:
Qฬโ(s,โa)โโโ(1โ โโ ฮท)Qฬโ(s,โa)โ +โ ฮท(rโ +โ ฮณVฬโ(sโฒ))
where r is the reward, ฮท is the step size, and
Vฬโ(sโฒ)โ=โmaxaโฒโโโActions(sโฒ)Qฬโ(sโฒ,โaโฒ)
Part b (written)
Now letโs apply Q-learning to an MDP and see how well it performs in comparison with value iteration. First, call simulate
using your Q-learning code and the identityFeatureExtractor()
on the MDP smallMDP
(defined for you in p3.py
), with 30000 trials. How does the Q-learning policy compare with a policy learned by value iteration (i.e., for how many states do they produce a different action)? (Donโt forget to set the explorationProb of your Q-learning algorithm to 0 after learning the policy.) Now run simulate()
on largeMDP
, again with 30000 trials. How does the policy learned in this case compare to the policy learned by value iteration? What went wrong?
Part c (code)
To address the problems explored in the previous exercise, letโs incorporate some domain knowledge to improve generalization. This way, the algorithm can use what it has learned about some states to improve its prediction performance on other states. Implement blackjackFeatureExtractor
as described in the code comments. Using this feature extractor, you should be able to get pretty close to the optimum on the largeMDP
.
Part d (written)
Sometimes, we might reasonably wonder how an optimal policy learned for one MDP might perform if applied to another MDP with similar structure but slightly different characteristics. For example, imagine that you created an MDP to choose an optimal strategy for playing โtraditionalโ blackjack, with a standard card deck and a threshold of 21. Youโre living it up in Vegas every weekend, but the casinos get wise to your approach and decide to make a change to the game to disrupt your strategy: going forward, the threshold for the blackjack tables is 17 instead of 21. If you continued playing the modified game with your original policy, how well would you do? (This is just a hypothetical example; we wonโt look specifically at the blackjack game in this problem.)
To explore this scenario, letโs take a brief look at how a policy learned using value iteration responds to a change in the rules of the MDP. For all subsequent parts, make sure to use 30,000 trials.
First, run value iteration on the
originalMDP
(defined for you inp3.py
) to compute an optimal policy for that MDP.Next, simulate your policy on
newThresholdMDP
(also defined for you inp3.py
by callingsimulate
with an instance ofFixedRLAlgorithm
that has been instantiated using the policy you computed with value iteration. What is the expected reward from this simulation? Hint: read the documentation (comments) for thesimulate
function in util.py, and look specifically at the format of the functionโs return value.Now try simulating Q-learning directly on
newThresholdMDP
withblackjackFeatureExtractor
and the default exploration probability. What is your expected reward under the new Q-learning policy? Provide some explanation for how the rewards compare, and why they are different.
Turning in the Assignment
For this assignment, you must turn in a zip file of a directory named project3
containing the following files:
p3.py
p3.pdf
โ your answers to the written parts
Submit the zip file to the appropriate folder on D2L.