Commentary on "On the Usefulness of Opponent Modeling: the Kuhn
Poker case study"
by Nick Christenson
Originally published in the June 2010 issue of
Two Plus Two Magazine.
Background
Once again, I examine a paper that covers opponent modeling (OM). As I
have explained previously in this series, opponent modeling is the art
of reasoning out an opponent's strategy, such that we might develop an
exploitive counter-strategy.
The paper covered in this article is titled "On the Usefulness of
Opponent Modeling: the Kuhn Poker case study". It was written by
Lazaric, Quarelsimale, and Restelli and published in the Proceedings
of the 7th International Conference on Autonomous Agents and Multiagent
Systems (AAMAS) in May, 2008. It is
available
online from the conference web site.
Introduction
In their short paper, the authors want to examine the use of learning
techniques to improve performance in a certain class of games. The
games are known as POSG, which stands for "partially observable stochastic
games". These are games where no player has complete information about
the game, and the game has a random element. Hey! You know what? I
just noticed that poker meets this definition! What an amazing coincidence.
Most of the work in applying learning techniques to improve poker
performance has gone in to determining how accurately these techniques
model an opponent's actual behavior. This is different than knowing
how much EV we gain from creating such a model and altering our
strategy to attempt to gain from it.
This paper explores how well these techniques can improve performance.
It also examines circumstances where these methods can actually
hurt performance. If we know what "policies" (what poker players would
call strategy) our opponent will follow we can modify our strategy to
take advantage of opponent deviations from optimal play. The problem is
that we can't ever completely know our opponent's strategy, we can only
approximate it based on our observations. If our estimate of our opponent's
strategy is flawed, our exploitive strategy won't work as well as we might
hope, and may actually lower our EV.
Terminology
Before attacking the analysis, I want to take a moment to explain
some terminology used in the paper. The authors use some different
words for some concepts that should be familiar to poker players.
I've already mentioned "POSG", but they use the abbreviation "RL"
quite a bit. This refers to "reinforcement learning" (algorithm).
Basically, reinforcement learning is a way by which an algorithm
improves its performance by incorporating data based on situations
it encounters. It's an attempt to improve a computer program using
the same sorts of observation and modification techniques that any
person uses to learn.
The title of the paper mentions "Kuhn poker". This refers to a
classic paper in game theory written by Harold Kuhn in 1950, titled
"A Simplified Two-Person Poker". As far as I can determine, Kuhn's
paper was the second time game theory had ever been applied to the
game of poker.
A description of Kuhn
poker can be found at Wikipedia, but for those who have read
Chen and Ankenman's book, The Mathematics of Poker, it's
a variant of the AKQ game they discuss in the book, albeit a variant
that predates MoP by over 50 years.
Kuhn poker is played as follows: Each of two players antes some
amount, then each player is dealt a card from a deck consisting of
three cards of different rank. The third card remains unseen by either
player. A bet and one raise is permitted. There is only one betting
round.
Kuhn poker is one of the simplest possible games that still preserves
the essential nature of poker. This coupled with the fact that it has
been part of the game theory literature for almost as long as game
theory has been around makes it a natural venue for researchers looking
at new facets of game theory research.
The paper also relies upon the concept of "fixed opponents". A "fixed
opponent" is one that doesn't change his strategy over the course of
play. That doesn't mean that the opponent always plays exactly the same
way each time he encounters a specific situation. It is entirely possible
to add some random element to a fixed strategy. For example, a fixed
opponent might raise 80% of the time in a given situation and call 20%
of the time. All "fixed" means is that the percentage chance that each
play is made doesn't change over the course of several iterations (hands).
Of course, it may seem strange to talk about playing a fixed strategy
in a paper on opponent modeling. Specifying a fixed strategy is something
we do for our opponent, whereas our "hero" (the algorithm being updated
by the opponent modeling process) is free to modify its strategy.
The paper defines a concept the authors call, "mutual information".
This refers to how much our opponent's play reveals about his strategy.
For example, if an opponent of moderate skill calls on the end after we
bet, then before the showdown we know that they don't have a hand
that can't beat a bluff. If they had a hand that couldn't beat a bluff
they might either fold or raise, but not call. Therefore, we've restricted
the range of possible hands they might hold, so that play provides a
high level of mutual information. On the other hand, if a player is making
plays without looking at his hole card(s), then such a play provides no
mutual information.
Finally, the paper discusses something called a "Partially Observable
Markov Decision Process" (POMDP). Again,
a
mathematically technical definition exists in Wikipedia. If we can
reduce a game to a POMDP, then we can use some well understood techniques
to approximate solutions. The downside is that in order to use these
techniques we have to have a fixed opponent. This is a reason why the
authors stipulate one.
What is a POMDP? Well, massively dumbed down, whenever you hear "Markov
process" you can think of a flow chart or tree where we move from node
to node. At each stage of the tree there is a payoff, which can be thought
of as gains or losses in EV. A decision process means that someone chooses
the path we follow through this tree. "Partially observable" means that
we don't always know everything about the payoffs as we move from node
to node. I expect that mathematicians are cringing at this description,
but it's the best I can do in four sentences using no mathematical terms.
In any case, when using fixed strategies poker games meet the definition
of a POMDP.
Analysis
So, now the authors have their algorithm play fixed opponents
applying a series of non-optimal strategies. As our hero accumulates
data, we're basically creating a correlation between how our opponents
act and what that tells us about their hole cards. From this data we
form a model about what we think our opponent's strategy is, and then
we modify our strategy in an attempt to exploit our opponents' errors.
The authors note that our opponent may pursue a strategy based only
on their hole card. Note that the optimal strategy for the game is
a strategy that fits this definition. At the other extreme, our opponent
might be pursuing a strategy based entirely on some random function
and not at all on his hole card. If this is the case, then from
the previous section the "mutual information" provided is zero, and
we gain no benefit from opponent modeling. Basically, by mathematics
the authors have formally expressed the maxim, "I can't put you on a
hand if you don't look at your cards."
At this point the authors provide some graphs showing how effective
adaptive strategies can be given a certain amount of mutual information
and how good is our opponent's choice of fixed strategy. The graphs
can be hard to decipher, especially since we're provided little detail
on what form the strategies themselves take. In any case, what they
provide is all very abstract and difficult to relate to more familiar
poker situations.
As we'd expect, if our opponent is playing mostly randomly, we gain
very little benefit from an opponent model. If our opponent is playing
mostly optimally, then some deviations in strategy that we can model
will produce minor gains, while some will produce substantial gains.
This is hardly surprising.
The most important conclusion from the paper is what happens if our
opponent model is incorrect. This is an issue that can easily arise
since our hero is forming an opponent model based on limited information.
The graph in Figure 4 shows a situation where our opponent model is
close to how our opponent plays, but is just a little bit off, and we
attempt to use it to exploit our opponent. In some cases (represented
by the blue arrows) the fact that our model is incorrect doesn't hurt
us much. We still see a benefit to using the model, even if it has flaws.
However, in many cases (represented by the red arrows), we do worse
against this opponent by using our imperfect opponent model.
Of course, we can't translate these results directly to real poker
games, but this does serve as a worthwhile cautionary tale. In some
poker games, if we attempt to exploit our opponents' play using an
imperfect model, we run the very real risk of lowering our EV
against them, even if all the data are collected and interpreted
correctly. Further, the results of the paper represent a situation
where we're playing poker against a fixed opponent. If we were to
add the possibility that our opposition might be "changing gears",
we can only conclude that an opponent model could be a much bigger
hinderance than benefit.
Learning Experiments
Finally, the authors examine a situation where our hero is a learning
algorithm playing a session against a fixed opponent. There are three
scenarios they examine:
- Our hero only learns about our opponent based on his own card.
That is, he might exploit his opponent based on things such as
calling to folding ratios, but not based on the specific hands revealed
at show down.
- Our hero uses the private card information and aggressively
forms an opponent model based on all the concepts discussed in this
paper.
- Our hero realizes that in the initial stages its model is
probably somewhat flawed due to a dearth of information, so he introduces
the influence of the opponent model slowly on top of the strategy
from scenario 1.
As one can see from Figure 5, the hole card only model gets better
over time and is a consistent if modest winner. The aggressive
opponent model starts out getting it's backside royally kicked
before eventually learning enough to be a significant winner. The
conservative approach to slow adoption of an opponent model seems to
represent the best of both worlds. It starts out performing only slightly
worse than the hole card only model, and quickly matches and then exceeds
its performance. Further, even when the aggressive opponent model is
up and running at its peak, the reliable opponent model does almost as
well.
As a poker player not writing to an audience of professional
mathematicians, my observations on this result are quite different
from what is reflected in the paper. First, opponent modeling with
too little data is horrific. This shouldn't be too surprising.
If I'm basing my OM on the results of one hand, I'm probably going to
end up concluding that my opponent is either hyper aggressive or
hyper passive. No human being plays a strategy similar to the aggressive
OM. Players never build our opponent models from scratch. Even
with a small number of hands we probably put a new opponent in one
of several broad categories and modify our opinion from there rather
than assume what we have observed is sufficient to understand his
entire game plan.
If we accumulate enough data, eventually our model becomes good enough
to completely supplant any a priori assumptions we might have made.
In fact, the paper demonstrates that we should be concerned that at some
point the rough framework that we started with becomes more of a
hinderance than a benefit.
It actually takes a quite a few hands to produce a good model.
In the paper, it takes about 500 hands for the aggressive opponent model
to have the same edge over our opponent as the algorithm that used no
opponent modeling, and that doesn't count the losses that occurred before
the model matures. Further, this is a game with a three card deck and
one betting round. While we don't know how much additional information
would be necessary to build up an opponent model of similar quality for
a real poker game, given the additional degrees of freedom such a game
represents, I think it's fair to assume that the number of hands
necessary would go up, likely way up.
Heck, in this example it takes 100 hands for the conservative opponent
model to demonstrably exceed the EV of the learning algorithm without
an OM. If I'm playing cash games in Las Vegas, I may never play more
than a hundred or so hands in my life against most of the opponents I
face. Of course, it's also possible that even though the complexity of a
real poker opponent model go way up, the benefits of forming such a
model go up even faster, making it far more beneficial in real poker
than it would appear in this particular simplified case. Until someone
applies these methods to more complex poker games, we'll never know.
Conclusions
Opponent modeling is not easy. We need a lot of data to really do
it right, probably more than we might at first think. Even when we
do this process well, there are many ways our opponents might deviate
from optimal play where we won't see large gains against such an
opponent. Moreover, when we create a bad opponent model, doing so
can easily hurt our EV, even if our opponent makes no attempt to
exploit our deviation from optimal play. My recommendation to players
attempting to exploit their opponents to err on the side of caution,
especially if there's a chance that those opponents might be capable
of switching gears.
I've examined a bunch of articles on poker research in this series,
and I hope people have benefited from them. If there are folks out
there who are looking at other poker research that's publicly available,
or know of research they'd like to see scrutinized, email me, PM me,
or post suggestions on the magazine forum.
|