LV Revealed
 
 


Commentary on "On the Usefulness of Opponent Modeling: the Kuhn Poker case study"

by Nick Christenson

Originally published in the June 2010 issue of Two Plus Two Magazine.

Background

Once again, I examine a paper that covers opponent modeling (OM). As I have explained previously in this series, opponent modeling is the art of reasoning out an opponent's strategy, such that we might develop an exploitive counter-strategy.

The paper covered in this article is titled "On the Usefulness of Opponent Modeling: the Kuhn Poker case study". It was written by Lazaric, Quarelsimale, and Restelli and published in the Proceedings of the 7th International Conference on Autonomous Agents and Multiagent Systems (AAMAS) in May, 2008. It is available online from the conference web site.

Introduction

In their short paper, the authors want to examine the use of learning techniques to improve performance in a certain class of games. The games are known as POSG, which stands for "partially observable stochastic games". These are games where no player has complete information about the game, and the game has a random element. Hey! You know what? I just noticed that poker meets this definition! What an amazing coincidence.

Most of the work in applying learning techniques to improve poker performance has gone in to determining how accurately these techniques model an opponent's actual behavior. This is different than knowing how much EV we gain from creating such a model and altering our strategy to attempt to gain from it.

This paper explores how well these techniques can improve performance. It also examines circumstances where these methods can actually hurt performance. If we know what "policies" (what poker players would call strategy) our opponent will follow we can modify our strategy to take advantage of opponent deviations from optimal play. The problem is that we can't ever completely know our opponent's strategy, we can only approximate it based on our observations. If our estimate of our opponent's strategy is flawed, our exploitive strategy won't work as well as we might hope, and may actually lower our EV.

Terminology

Before attacking the analysis, I want to take a moment to explain some terminology used in the paper. The authors use some different words for some concepts that should be familiar to poker players.

I've already mentioned "POSG", but they use the abbreviation "RL" quite a bit. This refers to "reinforcement learning" (algorithm). Basically, reinforcement learning is a way by which an algorithm improves its performance by incorporating data based on situations it encounters. It's an attempt to improve a computer program using the same sorts of observation and modification techniques that any person uses to learn.

The title of the paper mentions "Kuhn poker". This refers to a classic paper in game theory written by Harold Kuhn in 1950, titled "A Simplified Two-Person Poker". As far as I can determine, Kuhn's paper was the second time game theory had ever been applied to the game of poker.

A description of Kuhn poker can be found at Wikipedia, but for those who have read Chen and Ankenman's book, The Mathematics of Poker, it's a variant of the AKQ game they discuss in the book, albeit a variant that predates MoP by over 50 years.

Kuhn poker is played as follows: Each of two players antes some amount, then each player is dealt a card from a deck consisting of three cards of different rank. The third card remains unseen by either player. A bet and one raise is permitted. There is only one betting round.

Kuhn poker is one of the simplest possible games that still preserves the essential nature of poker. This coupled with the fact that it has been part of the game theory literature for almost as long as game theory has been around makes it a natural venue for researchers looking at new facets of game theory research.

The paper also relies upon the concept of "fixed opponents". A "fixed opponent" is one that doesn't change his strategy over the course of play. That doesn't mean that the opponent always plays exactly the same way each time he encounters a specific situation. It is entirely possible to add some random element to a fixed strategy. For example, a fixed opponent might raise 80% of the time in a given situation and call 20% of the time. All "fixed" means is that the percentage chance that each play is made doesn't change over the course of several iterations (hands).

Of course, it may seem strange to talk about playing a fixed strategy in a paper on opponent modeling. Specifying a fixed strategy is something we do for our opponent, whereas our "hero" (the algorithm being updated by the opponent modeling process) is free to modify its strategy.

The paper defines a concept the authors call, "mutual information". This refers to how much our opponent's play reveals about his strategy. For example, if an opponent of moderate skill calls on the end after we bet, then before the showdown we know that they don't have a hand that can't beat a bluff. If they had a hand that couldn't beat a bluff they might either fold or raise, but not call. Therefore, we've restricted the range of possible hands they might hold, so that play provides a high level of mutual information. On the other hand, if a player is making plays without looking at his hole card(s), then such a play provides no mutual information.

Finally, the paper discusses something called a "Partially Observable Markov Decision Process" (POMDP). Again, a mathematically technical definition exists in Wikipedia. If we can reduce a game to a POMDP, then we can use some well understood techniques to approximate solutions. The downside is that in order to use these techniques we have to have a fixed opponent. This is a reason why the authors stipulate one.

What is a POMDP? Well, massively dumbed down, whenever you hear "Markov process" you can think of a flow chart or tree where we move from node to node. At each stage of the tree there is a payoff, which can be thought of as gains or losses in EV. A decision process means that someone chooses the path we follow through this tree. "Partially observable" means that we don't always know everything about the payoffs as we move from node to node. I expect that mathematicians are cringing at this description, but it's the best I can do in four sentences using no mathematical terms. In any case, when using fixed strategies poker games meet the definition of a POMDP.

Analysis

So, now the authors have their algorithm play fixed opponents applying a series of non-optimal strategies. As our hero accumulates data, we're basically creating a correlation between how our opponents act and what that tells us about their hole cards. From this data we form a model about what we think our opponent's strategy is, and then we modify our strategy in an attempt to exploit our opponents' errors.

The authors note that our opponent may pursue a strategy based only on their hole card. Note that the optimal strategy for the game is a strategy that fits this definition. At the other extreme, our opponent might be pursuing a strategy based entirely on some random function and not at all on his hole card. If this is the case, then from the previous section the "mutual information" provided is zero, and we gain no benefit from opponent modeling. Basically, by mathematics the authors have formally expressed the maxim, "I can't put you on a hand if you don't look at your cards."

At this point the authors provide some graphs showing how effective adaptive strategies can be given a certain amount of mutual information and how good is our opponent's choice of fixed strategy. The graphs can be hard to decipher, especially since we're provided little detail on what form the strategies themselves take. In any case, what they provide is all very abstract and difficult to relate to more familiar poker situations.

As we'd expect, if our opponent is playing mostly randomly, we gain very little benefit from an opponent model. If our opponent is playing mostly optimally, then some deviations in strategy that we can model will produce minor gains, while some will produce substantial gains. This is hardly surprising.

The most important conclusion from the paper is what happens if our opponent model is incorrect. This is an issue that can easily arise since our hero is forming an opponent model based on limited information. The graph in Figure 4 shows a situation where our opponent model is close to how our opponent plays, but is just a little bit off, and we attempt to use it to exploit our opponent. In some cases (represented by the blue arrows) the fact that our model is incorrect doesn't hurt us much. We still see a benefit to using the model, even if it has flaws. However, in many cases (represented by the red arrows), we do worse against this opponent by using our imperfect opponent model.

Of course, we can't translate these results directly to real poker games, but this does serve as a worthwhile cautionary tale. In some poker games, if we attempt to exploit our opponents' play using an imperfect model, we run the very real risk of lowering our EV against them, even if all the data are collected and interpreted correctly. Further, the results of the paper represent a situation where we're playing poker against a fixed opponent. If we were to add the possibility that our opposition might be "changing gears", we can only conclude that an opponent model could be a much bigger hinderance than benefit.

Learning Experiments

Finally, the authors examine a situation where our hero is a learning algorithm playing a session against a fixed opponent. There are three scenarios they examine:

  1. Our hero only learns about our opponent based on his own card. That is, he might exploit his opponent based on things such as calling to folding ratios, but not based on the specific hands revealed at show down.
  2. Our hero uses the private card information and aggressively forms an opponent model based on all the concepts discussed in this paper.
  3. Our hero realizes that in the initial stages its model is probably somewhat flawed due to a dearth of information, so he introduces the influence of the opponent model slowly on top of the strategy from scenario 1.

As one can see from Figure 5, the hole card only model gets better over time and is a consistent if modest winner. The aggressive opponent model starts out getting it's backside royally kicked before eventually learning enough to be a significant winner. The conservative approach to slow adoption of an opponent model seems to represent the best of both worlds. It starts out performing only slightly worse than the hole card only model, and quickly matches and then exceeds its performance. Further, even when the aggressive opponent model is up and running at its peak, the reliable opponent model does almost as well.

As a poker player not writing to an audience of professional mathematicians, my observations on this result are quite different from what is reflected in the paper. First, opponent modeling with too little data is horrific. This shouldn't be too surprising. If I'm basing my OM on the results of one hand, I'm probably going to end up concluding that my opponent is either hyper aggressive or hyper passive. No human being plays a strategy similar to the aggressive OM. Players never build our opponent models from scratch. Even with a small number of hands we probably put a new opponent in one of several broad categories and modify our opinion from there rather than assume what we have observed is sufficient to understand his entire game plan.

If we accumulate enough data, eventually our model becomes good enough to completely supplant any a priori assumptions we might have made. In fact, the paper demonstrates that we should be concerned that at some point the rough framework that we started with becomes more of a hinderance than a benefit.

It actually takes a quite a few hands to produce a good model. In the paper, it takes about 500 hands for the aggressive opponent model to have the same edge over our opponent as the algorithm that used no opponent modeling, and that doesn't count the losses that occurred before the model matures. Further, this is a game with a three card deck and one betting round. While we don't know how much additional information would be necessary to build up an opponent model of similar quality for a real poker game, given the additional degrees of freedom such a game represents, I think it's fair to assume that the number of hands necessary would go up, likely way up.

Heck, in this example it takes 100 hands for the conservative opponent model to demonstrably exceed the EV of the learning algorithm without an OM. If I'm playing cash games in Las Vegas, I may never play more than a hundred or so hands in my life against most of the opponents I face. Of course, it's also possible that even though the complexity of a real poker opponent model go way up, the benefits of forming such a model go up even faster, making it far more beneficial in real poker than it would appear in this particular simplified case. Until someone applies these methods to more complex poker games, we'll never know.

Conclusions

Opponent modeling is not easy. We need a lot of data to really do it right, probably more than we might at first think. Even when we do this process well, there are many ways our opponents might deviate from optimal play where we won't see large gains against such an opponent. Moreover, when we create a bad opponent model, doing so can easily hurt our EV, even if our opponent makes no attempt to exploit our deviation from optimal play. My recommendation to players attempting to exploit their opponents to err on the side of caution, especially if there's a chance that those opponents might be capable of switching gears.

I've examined a bunch of articles on poker research in this series, and I hope people have benefited from them. If there are folks out there who are looking at other poker research that's publicly available, or know of research they'd like to see scrutinized, email me, PM me, or post suggestions on the magazine forum.