Commentary on "Using Counterfactual Regret Minimization to Create
Competitive Multiplayer Agents"
by Nick Christenson
This article was previously published in the March 2011 issue of
TwoPlusTwo Online Magazine.
Motivation
In this article I'm looking at research that's heading in a new direction
from previous columns. All the research I've covered so far has been
regarding algorithms for twoplayer poker. This time I'm discussing
research that tackles threeplayer games. It might seem that this would
be a pretty straightforward extension of headsup play. It's not. From
a mathematical and algorithmic standpoint things get really hairy. This
paper should give a sense of some of these issues.
The paper I'm writing about this week is titled
Using
Counterfactual Regret
Minimization to Create Competitive Multiplayer Poker Agents. The
authors are Nick Abou Risk and Duane Szafron from the Computer Poker
Research Group (CPRG) at the University of Alberta. It was published in
the Proceedings of the 9th International Conference on Autonomous
Agents and Multiagent Systems, held in Toronto in May 2010.
Introduction
The opening section of the paper sets the stage for what the CPRG folks
are up to these days. They start by providing background and defining
terms. This includes discussion of extensive games and of modeling games
as mathematical trees. Readers of this series have seen this stuff before,
so I'll pass over it as well. However, at the end of the first paragraph,
they float the notion of treating the chance events in the directed graph
representation of the game (cards dealt in the poker case) as the decision
nodes for an additional, separate player. I don't recall anyone doing
this before. It's not clear how this influences their analysis, but it
caught my attention.
Risk and Szafron move on to discuss several techniques that are useful in
designing poker playing algorithms. The rest of this section reads like
a synopsis of this article series. They discuss the use of linear
programming to find exact solutions to poker games, but state that this
technique is currently limited to simplified games. This is a topic I
covered in a couple of articles, including the one published
in the
April 2010 issue of this magazine. They also discuss abstraction
methods, which I discussed in the
June
2009 article in this series. What's new this time is that they're
dealing with threeplayer limit Texas hold'em, and they're using a new
technique, called "Counterfactual Regret Minimization" (CFR).
Background
Section 2.1 describes the intricacies of poker. As usual, we'll skip over
this since readers of this publication are quite familiar with how the game
is played. Section 2.2 defines some terms that we'll consider here.
information set
This is the set of all possible situations a player may face in the game.
For example, in headsup limit hold'em, a player is dealt A♠ K♣,
raises preflop from the SBB, and his opponent calls. The flop comes
A♣ T♦ 4♥ and the opponent checks. This is one situation.
Assemble every possible situation, and you have the information set for a
game. These situations can be simplified using methods such as eliminating
suitedness issues, but even so for real poker games the information set
is intractable.
Note that the information set is different from the game set. The game
set consists of each player's cards, the board, and the betting action
up to that point. Because this includes more than one player's cards,
the game set is much larger than the information set. This difference
becomes important later on.
strategy profile
At each decision point in a poker game, players may take one of three
actions. They may fold, check/call, or bet/raise. At each situation in
the information set, a player will pursue each of these courses of
action with some probability. Obviously, the sum of these three
probabilities must equal 1. The list of all of these probabilities for
each response in each situation in the information set for a particular
game is the strategy profile.
Nash equilibrium
This is a term the reader may have heard bandied about by folks talking
about game theory. The technical definition of a Nash equilibrium is
when each player has adopted a strategy profile such that no player may
increase his EV solely by changing that strategy profile. In twoperson
zero sum games, Nash equilibria always exist, and this is basically
another name for the optimal strategy. In nonzero sum games or
multiplayer games, there is no guarantee that a player following such
a strategy will actually maximize his EV. This contributes to making
the study of these games difficult.
εNash equilibrium
Because computing an exact Nash equilibrium for complex games, such as
real poker games, is intractable, we're pretty happy if we can find a
strategy profile such that if our opponent changes his strategy profile,
his best case is that any gains are bounded and small. By "bounded and
small" we mean that we can specify a worstcase scenario. Presumably,
by increasing the computing power we expend, we can also reduce the
magnitude of that worstcase scenario as far as we would like. Usually,
when mathematicians talk about some unspecified and selectable small
value, they use the Greek letter epsilon (ε) to represent that
value, as in this terminology.
card abstraction
This is the idea of grouping similar game situations into a few different
categories, called buckets in the paper, and applying the same strategy
to each of these different, but supposedly similar, situations. Again,
this was discussed at some length in the
June
2009 article in this series.
perfect and imperfect recall
Perfect recall means that our decisions on later poker streets are
influenced by the decisions made on earlier streets, rather than just
our hand strength, the pot size, and opponent actions on that betting
round. Imperfect recall means that we're not considering all of what
has gone before in this hand when we make our decisions. Obviously,
perfect recall is better. The problem is that the number of situations
in the strategy profile explodes in complex games.
Counterfactual Regret Minimization (CFR)
Finally we get to those mysterious words in the title of the paper. Let's
define them individually. "Regret" is a formal definition of how different
the best possible outcome would be for a player from the actual outcome.
If the actual outcome isn't as good as how well things could have gone,
then that difference is how much the player "regrets" the decision.
Qualitatively, it's the lost opportunity cost. "Counterfactual" literally
means contrary to fact. In this case, it refers to the comparison of
the best outcome to the actual outcome. "Minimization" in this context
is pretty obvious. The CFR algorithm is designed to minimize overall
regret.
As it turns out, the CFR algorithm leads to εNash equilibria.
What's exciting is that we can evaluate CFR based on the information
set of one player rather than the entire game set, as we'd need to
find a true Nash equilibrium. As was mentioned earlier, because one
player's information set is much smaller than the game set, this makes
finding these εNash equilibria much easier. The authors claim
that this can reduce the amount of data storage required by a factor
of 10,000. That's a huge deal, and it's why the authors are so excited
about applying CFR to poker problems.
Section 2.4 covers some of the problems with Nash equilibria in
multiplayer games. The example given is fascinating and demonstrates
these issues. I recommend folks read this section carefully if they
are interested in examples of why threeplayer games are so radically
different from twoplayer games.
CFR for Multiplayer Games
The problem with using CFR here is that there's no guarantee that
CFR will find an εNash equilibrium strategy in a multiplayer game.
Moreover, even if one is found, there's no guarantee that it will actually
end up being a winning strategy, as that depends on how our opponents play.
So, the authors elect to test this algorithm using toy threeperson poker
games.
The first one they trot out is an extended form of the game called
Kuhn poker. This is similar to the AKQ game covered in The Mathematics
of Poker. Kuhn poker was discussed in the
June
2010 article from this series.
The second game is one the CPRG folks developed called Leduc hold'em.
It's a very simplified version of Texas hold'em that nonetheless preserves
some if the real game's flavor. Leduc hold'em was discussed in the
February
2010 article from this series.
The authors discover that there exists an εNash equilibrium for
Kuhn poker, but they couldn't find one for Leduc hold'em. Presumably,
this means they won't be able to come up with one for more complex games
either, so they'll have to determine if the strategy that CFR comes up with
is any good by actually playing it out. Since they have to do this anyway,
they may as well just proceed using real Texas hold'em rather than a
simplified poker game.
CFR for Multiplayer Hold 'em
At this point the authors devote a great deal of computational time to
examining these strategies. First, they need some opponents. Their old
standby Poki is available from Poker
Academy Pro, so they use that. They also use three simplistic
opponents they categorize as "chumps". Specifically they're "always call"
(AC), "always raise" (AR), and "probe". The strategies adopted by
"always call" and "always raise" are easy to imagine. "Probe" is just
a variant of these. At each decision point it will call half the time
and raise half the time.
Then they need some algorithms to test. They create a "perfect recall"
model they call PR2. Earlier in this article, a "perfect recall" algorithm
was defined as one that retains all the information on what has gone before
in this hand, and that this gets expensive in terms of memory use. So,
the authors are forced to use a severe hand abstraction, one with only two
buckets! How good do you think your play would be if you only grouped the
situations you encountered in poker as above and below average?
They also implement an imperfect recall model. Because such an
algorithm requires less memory, they can use 16 buckets for situation
categorization on each betting round. They iterated this model for
20 million hands, and label the resulting strategy as IR16S. A second
instance of the same model they iterated over 43 million hands, and
they called that resulting strategy IR16L.
Next they played all these algorithms against each other in each seating
permutation. They played two types of games: "bankroll" games, which
are just like cash games poker players are used to, where winning and
losing is defined by how much you're up or down after some number of
hands, and "elimination" games, where the opponents play for a while,
determine who is the worst player based on the results, eliminate that
player, zeroing out wins and losses against that player, and iteratively
reevaluating who is best until only the top three remain. This method is
designed to emphasize being the "least exploitable" player rather than
the most profitable one. Since we're looking for nearoptimal strategies,
this unorthodox evaluation system makes some sense.
The results are shown in Table 3 of the paper. In bankroll events,
PR2 was the winner followed by IR16S and IR16L respectively. In
elimination events the order among the top three was reversed.
All three algorithms using CFR beat up on Poki, and everyone beat up
on the chumps.
Here's an interesting side note. Look at the twelfth row of the table,
where AR and AC are playing against opponents. Everyone except
PR2 loses to these two opponents. Why is it that these two chump
programs did so well in this one situation? I believe this is an example
of how a threeplayer game changes things so radically. Note that both
opponents are loose and rank somewhere between very and hyper aggressive.
Certainly, always raise is the most aggressive player in the competition,
and probe is almost certainly second. So, we have a lot of raising and
every hand going to the river, many of them threeway. Basically, AR
and Probe are whipsawing the third opponent on every hand. For those
who have read The Mathematics of Poker, this situation is
very similar to the one the authors of the book label Example 29.1.
Basically, without any form of communication, AR and Probe are implicitly
colluding with each other to their opponent's detriment. It's not that
such a strategy can't be beat, but it's not as simple as one might imagine
at first.
Using HeadsUp Experts
Of course, in many threeperson poker hands one player will fold, and
then we're left with headsup poker. Of course, we know there are better
ways of tackling twoplayer games than our multiplayer strategies.
Once the third player is out, we can now guarantee an εNash
equilibrium, we have simplified information sets, we can use more
situation buckets on each betting round, everything is easier. All
the authors need to do is combine these two sets of strategies together.
They call their twoplayer algorithms "headsup experters", or HUEs.
Of course knitting these two strategies together is not so easy. There's
extra money in the pot that changes the equity in the situation, we
have some extra situations in our information set because of the additional
positional possibilities, there are potential card removal considerations,
and we have to deal with a mismatch in the previous betting round's
information set. For example, what is our twoperson algorithm to make
of a preflop betting sequence that goes "call, raise, fold, call"?
The authors provide some information on how they deal with this, but
the leave out a lot of details in this paper. They can get around the
betting sequence problems by using an "imperfect recall" model, but
hand bucketing in a multiplayer situation is just going to be different.
I'd like to hear more on this, but we don't get that information in this
paper.
The algorithms play again, this time with no chumps in the mix, and the
results are presented in Table 5. The winner was IR16L with the addition
of a uniformly seeded HUE, followed by Ex16, which was based on IR16L with
an expert seeded HUE. The rankings were the same in "elimination" matches
as they were in "bankroll" matches. Looks like combining the threeplayer
and twoplayer algorithms can work pretty well.
Conclusion
The IR16L and Ex16 algorithms were entered in the 2009
Annual
Computer Poker Competition. They finished first and second out
of seven competitors entered in the threeperson limit Texas hold'em
competition. That's an excellent result, of course.
What did we learn from this? Even if CFR can't find optimal strategies
to threeplayer poker, apparently it can provide pretty good ones. We
also know that CFR can be used to find εNash equilibria for
twoperson games. With this algorithm we can create perfect and imperfect
recall poker agents, although with current limitations on the amount of
computer memory available to the researchers, 16bucket imperfect recall
agents seem to perform better than 2bucket perfect recall agents.
We also see that it can be fruitful to knit in HUEs to take over play
once a hand has gotten headsup. Not too shabby.
The state of the art for computer poker play continues to advance at an
alarming rate, moving from headsup limit poker to big bet poker, and
from twoplayer to multiplayer games. The mathematical toolkit,
programming techniques, and computing power all combine to make computer
poker opponents increasingly formidable.
