Early Work on Rating Systems

November 9, 2022

Every capstone AI season I intend to implement a proper rating system, in order to more rigorously assess the playing strength of my engines (and us human testers), and every time I never get around to it. I figure it’s time to get ahead of the game.

Naturally I started with the well-known Elo rating system. The Wikipedia article led me to Mark Glickman’s site. Glickman has created his own Glicko rating system that offers several advantages over the Elo system (namely, a measure of rating uncertainty in addition to strength), and he also has published multiple papers on the broader subject. I found his paper “A Comprehensive Guide to Chess Ratings” extremely helpful in getting oriented.

Here are a few unorganized notes on what I’ve read so far:

Engines are simpler

Chess rating systems dedicate a lot of effort to accounting for drift in player skill. The system needs to account for player ability growing and waning over time. Recent games should be treated as more relevant data on player ability than games played a long time ago. There’s an assumption that some players, such as children and beginners, may grow in ability faster than well-established players. Players also widely vary in how often they play rated games.

But I don’t have to work with human players! A specific build of an engine configured with specific parameters (treating any learning models as a type of parameter) should in theory play at a specific strength. This means I should be able to make a number of simplifications.

Plus, AI players show up to tournaments whenever you want more data from them.

Engines can anchor rating systems

Not only do player abilities drift—the rating system itself drifts. Depending on the mechanics of the system and players, it is ratings may inflate or deflate. This means that, over time, a particular numerical rating would reflect a different level of player ability.

Glickman describes that, since AI engines play consistently, they could be used to anchor a rating system. There are a few implementation difficulties, like making sure humans play enough games against the AI players, but it is an interesting idea.

Accounting for turn order advantages

Glickman points out that rating systems typically assign a single number to gauge player strength, but many games have an advantage given to a partiuclar player (such as white’s first move advantage in chess). He discusses ways of dealing with this issue, such adding a constant to the ratings update calculation that functionally treats the white player as having a higher rating.

For my purposes, I wonder if I can just calculate color-specific ratings in addition to a standard rating. The color-specific ratings would only be updated when the AI player plays a particular color. It seems like this sort of approach would also be useful in determing the advantage constant for the above approach.

I really should look at Fishtest

The goal for this research is to better assess the playing strength of AI engines, so I can tell if a particular change results in an improvement. In the chess world, there already a well-used system for doing this: Stockfish’s Fishtest system.

I believe strongly in learning the fundamentals of a problem and trying your own approach instead of immediately jumping to established solutions. On the other hand, doing that too much risks reinventing the wheel, and likely reinventing it poorly. At some point (probably some point soon), I should dig into Fishtest to see how it solves many of these same problems.