LV Revealed
 
 

Commentary on "Using Counterfactual Regret Minimization to Create Competitive Multiplayer Agents"

by Nick Christenson

This article was previously published in the March 2011 issue of TwoPlusTwo Online Magazine.

Motivation

In this article I'm looking at research that's heading in a new direction from previous columns. All the research I've covered so far has been regarding algorithms for two-player poker. This time I'm discussing research that tackles three-player games. It might seem that this would be a pretty straightforward extension of heads-up play. It's not. From a mathematical and algorithmic standpoint things get really hairy. This paper should give a sense of some of these issues.

The paper I'm writing about this week is titled Using Counterfactual Regret Minimization to Create Competitive Multiplayer Poker Agents. The authors are Nick Abou Risk and Duane Szafron from the Computer Poker Research Group (CPRG) at the University of Alberta. It was published in the Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems, held in Toronto in May 2010.

Introduction

The opening section of the paper sets the stage for what the CPRG folks are up to these days. They start by providing background and defining terms. This includes discussion of extensive games and of modeling games as mathematical trees. Readers of this series have seen this stuff before, so I'll pass over it as well. However, at the end of the first paragraph, they float the notion of treating the chance events in the directed graph representation of the game (cards dealt in the poker case) as the decision nodes for an additional, separate player. I don't recall anyone doing this before. It's not clear how this influences their analysis, but it caught my attention.

Risk and Szafron move on to discuss several techniques that are useful in designing poker playing algorithms. The rest of this section reads like a synopsis of this article series. They discuss the use of linear programming to find exact solutions to poker games, but state that this technique is currently limited to simplified games. This is a topic I covered in a couple of articles, including the one published in the April 2010 issue of this magazine. They also discuss abstraction methods, which I discussed in the June 2009 article in this series. What's new this time is that they're dealing with three-player limit Texas hold'em, and they're using a new technique, called "Counterfactual Regret Minimization" (CFR).

Background

Section 2.1 describes the intricacies of poker. As usual, we'll skip over this since readers of this publication are quite familiar with how the game is played. Section 2.2 defines some terms that we'll consider here.

information set
This is the set of all possible situations a player may face in the game. For example, in heads-up limit hold'em, a player is dealt A♠ K♣, raises pre-flop from the SBB, and his opponent calls. The flop comes A♣ T♦ 4♥ and the opponent checks. This is one situation. Assemble every possible situation, and you have the information set for a game. These situations can be simplified using methods such as eliminating suitedness issues, but even so for real poker games the information set is intractable.

Note that the information set is different from the game set. The game set consists of each player's cards, the board, and the betting action up to that point. Because this includes more than one player's cards, the game set is much larger than the information set. This difference becomes important later on.

strategy profile
At each decision point in a poker game, players may take one of three actions. They may fold, check/call, or bet/raise. At each situation in the information set, a player will pursue each of these courses of action with some probability. Obviously, the sum of these three probabilities must equal 1. The list of all of these probabilities for each response in each situation in the information set for a particular game is the strategy profile.

Nash equilibrium
This is a term the reader may have heard bandied about by folks talking about game theory. The technical definition of a Nash equilibrium is when each player has adopted a strategy profile such that no player may increase his EV solely by changing that strategy profile. In two-person zero sum games, Nash equilibria always exist, and this is basically another name for the optimal strategy. In non-zero sum games or multi-player games, there is no guarantee that a player following such a strategy will actually maximize his EV. This contributes to making the study of these games difficult.

ε-Nash equilibrium
Because computing an exact Nash equilibrium for complex games, such as real poker games, is intractable, we're pretty happy if we can find a strategy profile such that if our opponent changes his strategy profile, his best case is that any gains are bounded and small. By "bounded and small" we mean that we can specify a worst-case scenario. Presumably, by increasing the computing power we expend, we can also reduce the magnitude of that worst-case scenario as far as we would like. Usually, when mathematicians talk about some unspecified and selectable small value, they use the Greek letter epsilon (ε) to represent that value, as in this terminology.

card abstraction
This is the idea of grouping similar game situations into a few different categories, called buckets in the paper, and applying the same strategy to each of these different, but supposedly similar, situations. Again, this was discussed at some length in the June 2009 article in this series.

perfect and imperfect recall
Perfect recall means that our decisions on later poker streets are influenced by the decisions made on earlier streets, rather than just our hand strength, the pot size, and opponent actions on that betting round. Imperfect recall means that we're not considering all of what has gone before in this hand when we make our decisions. Obviously, perfect recall is better. The problem is that the number of situations in the strategy profile explodes in complex games.

Counterfactual Regret Minimization (CFR)
Finally we get to those mysterious words in the title of the paper. Let's define them individually. "Regret" is a formal definition of how different the best possible outcome would be for a player from the actual outcome. If the actual outcome isn't as good as how well things could have gone, then that difference is how much the player "regrets" the decision. Qualitatively, it's the lost opportunity cost. "Counterfactual" literally means contrary to fact. In this case, it refers to the comparison of the best outcome to the actual outcome. "Minimization" in this context is pretty obvious. The CFR algorithm is designed to minimize overall regret.

As it turns out, the CFR algorithm leads to ε-Nash equilibria. What's exciting is that we can evaluate CFR based on the information set of one player rather than the entire game set, as we'd need to find a true Nash equilibrium. As was mentioned earlier, because one player's information set is much smaller than the game set, this makes finding these ε-Nash equilibria much easier. The authors claim that this can reduce the amount of data storage required by a factor of 10,000. That's a huge deal, and it's why the authors are so excited about applying CFR to poker problems.

Section 2.4 covers some of the problems with Nash equilibria in multiplayer games. The example given is fascinating and demonstrates these issues. I recommend folks read this section carefully if they are interested in examples of why three-player games are so radically different from two-player games.

CFR for Multiplayer Games

The problem with using CFR here is that there's no guarantee that CFR will find an ε-Nash equilibrium strategy in a multiplayer game. Moreover, even if one is found, there's no guarantee that it will actually end up being a winning strategy, as that depends on how our opponents play. So, the authors elect to test this algorithm using toy three-person poker games.

The first one they trot out is an extended form of the game called Kuhn poker. This is similar to the AKQ game covered in The Mathematics of Poker. Kuhn poker was discussed in the June 2010 article from this series.

The second game is one the CPRG folks developed called Leduc hold'em. It's a very simplified version of Texas hold'em that nonetheless preserves some if the real game's flavor. Leduc hold'em was discussed in the February 2010 article from this series.

The authors discover that there exists an ε-Nash equilibrium for Kuhn poker, but they couldn't find one for Leduc hold'em. Presumably, this means they won't be able to come up with one for more complex games either, so they'll have to determine if the strategy that CFR comes up with is any good by actually playing it out. Since they have to do this anyway, they may as well just proceed using real Texas hold'em rather than a simplified poker game.

CFR for Multiplayer Hold 'em

At this point the authors devote a great deal of computational time to examining these strategies. First, they need some opponents. Their old standby Poki is available from Poker Academy Pro, so they use that. They also use three simplistic opponents they categorize as "chumps". Specifically they're "always call" (AC), "always raise" (AR), and "probe". The strategies adopted by "always call" and "always raise" are easy to imagine. "Probe" is just a variant of these. At each decision point it will call half the time and raise half the time.

Then they need some algorithms to test. They create a "perfect recall" model they call PR2. Earlier in this article, a "perfect recall" algorithm was defined as one that retains all the information on what has gone before in this hand, and that this gets expensive in terms of memory use. So, the authors are forced to use a severe hand abstraction, one with only two buckets! How good do you think your play would be if you only grouped the situations you encountered in poker as above and below average?

They also implement an imperfect recall model. Because such an algorithm requires less memory, they can use 16 buckets for situation categorization on each betting round. They iterated this model for 20 million hands, and label the resulting strategy as IR16S. A second instance of the same model they iterated over 43 million hands, and they called that resulting strategy IR16L.

Next they played all these algorithms against each other in each seating permutation. They played two types of games: "bankroll" games, which are just like cash games poker players are used to, where winning and losing is defined by how much you're up or down after some number of hands, and "elimination" games, where the opponents play for a while, determine who is the worst player based on the results, eliminate that player, zeroing out wins and losses against that player, and iteratively re-evaluating who is best until only the top three remain. This method is designed to emphasize being the "least exploitable" player rather than the most profitable one. Since we're looking for near-optimal strategies, this unorthodox evaluation system makes some sense.

The results are shown in Table 3 of the paper. In bankroll events, PR2 was the winner followed by IR16S and IR16L respectively. In elimination events the order among the top three was reversed. All three algorithms using CFR beat up on Poki, and everyone beat up on the chumps.

Here's an interesting side note. Look at the twelfth row of the table, where AR and AC are playing against opponents. Everyone except PR2 loses to these two opponents. Why is it that these two chump programs did so well in this one situation? I believe this is an example of how a three-player game changes things so radically. Note that both opponents are loose and rank somewhere between very and hyper aggressive. Certainly, always raise is the most aggressive player in the competition, and probe is almost certainly second. So, we have a lot of raising and every hand going to the river, many of them three-way. Basically, AR and Probe are whipsawing the third opponent on every hand. For those who have read The Mathematics of Poker, this situation is very similar to the one the authors of the book label Example 29.1. Basically, without any form of communication, AR and Probe are implicitly colluding with each other to their opponent's detriment. It's not that such a strategy can't be beat, but it's not as simple as one might imagine at first.

Using Heads-Up Experts

Of course, in many three-person poker hands one player will fold, and then we're left with heads-up poker. Of course, we know there are better ways of tackling two-player games than our multiplayer strategies. Once the third player is out, we can now guarantee an ε-Nash equilibrium, we have simplified information sets, we can use more situation buckets on each betting round, everything is easier. All the authors need to do is combine these two sets of strategies together. They call their two-player algorithms "heads-up experters", or HUEs.

Of course knitting these two strategies together is not so easy. There's extra money in the pot that changes the equity in the situation, we have some extra situations in our information set because of the additional positional possibilities, there are potential card removal considerations, and we have to deal with a mismatch in the previous betting round's information set. For example, what is our two-person algorithm to make of a pre-flop betting sequence that goes "call, raise, fold, call"?

The authors provide some information on how they deal with this, but the leave out a lot of details in this paper. They can get around the betting sequence problems by using an "imperfect recall" model, but hand bucketing in a multiplayer situation is just going to be different. I'd like to hear more on this, but we don't get that information in this paper.

The algorithms play again, this time with no chumps in the mix, and the results are presented in Table 5. The winner was IR16L with the addition of a uniformly seeded HUE, followed by Ex16, which was based on IR16L with an expert seeded HUE. The rankings were the same in "elimination" matches as they were in "bankroll" matches. Looks like combining the three-player and two-player algorithms can work pretty well.

Conclusion

The IR16L and Ex16 algorithms were entered in the 2009 Annual Computer Poker Competition. They finished first and second out of seven competitors entered in the three-person limit Texas hold'em competition. That's an excellent result, of course.

What did we learn from this? Even if CFR can't find optimal strategies to three-player poker, apparently it can provide pretty good ones. We also know that CFR can be used to find ε-Nash equilibria for two-person games. With this algorithm we can create perfect and imperfect recall poker agents, although with current limitations on the amount of computer memory available to the researchers, 16-bucket imperfect recall agents seem to perform better than 2-bucket perfect recall agents. We also see that it can be fruitful to knit in HUEs to take over play once a hand has gotten heads-up. Not too shabby.

The state of the art for computer poker play continues to advance at an alarming rate, moving from heads-up limit poker to big bet poker, and from two-player to multiplayer games. The mathematical toolkit, programming techniques, and computing power all combine to make computer poker opponents increasingly formidable.