As a longtime chess player, I immediately thought of the Elo rating system as one possible way to quantify the strength of different game-playing AIs. Elo is used by many prominent chess organizations, and also for many other games, including some competitive online games.

When I first started this project, I hadn’t read anything specific about the actual Elo algorithm, but I figured it certainly would meet my needs in measuring AI strength. And honestly, it probably would. However, as I read up on the system, it became clear that Elo was designed for different needs than mine.

For example, a human chess player’s strength will change over time. Elo has a parameter called the K-factor that controls a rating’s sensitivity to change. Different organizations use different K-factors, and different players may be given different K-factors based on current rating or number of games played.

However, a game-playing AI’s strength should remain constant, given a particular build, a set number of iterations, and no learning systems. (Of course, learning systems would cause an AI to change strength, but Picket currently has no learning component, and the experiments I have immediately in mind study other parameters, and thus a learning system would confound the data.)

Elo systems also have to deal with matters such as rating inflation over time and how to combat players selectively choosing opponents to help preserve their own rating. These aren’t issues we have to worry about with game-playing AIs.

So, if Elo isn’t the best fit, what system might work better? Let’s consider the unique features of our domain.

First, to reiterate, our goal is to assign each player a numerical score that represents its relative playing strength. The number we actually will calculate is necessarily an approximation, since we can only calculate from a finite number of games.

In this goal is the assumption that players’ strengths are transitive–if A normally beats B and C normally beats A, then C will normally beat B. This is not necessarily true; for example, C might be a poor player that just happens to use a strategy A is weak against. However, for the time being we’ll accept this assumption in the general case.

As already mentioned, we can assume player strength will not change over time.

We also will have a corpus of all rated games played. When a player wins a game, instead of only being able to calculate a new rating based on the player’s current rating and the opponent’s rating, we can treat that game as a data point alongside all previous games, and calculate a new rating in a fuller context.

This combines well with the previous assumption–since a player’s strength doesn’t change over time, the very first game a player plays tells as much about its strength as the most recent game.

Keeping these unique features of our requirements and domain in mind will help us formulate a system that fits our needs.