schneiderbox


Glicko Testbeds

November 13, 2022

I read Mark Glickman’s quite accessible paper on his Glicko rating system. To help me better understand how it works, I built a couple testbeds to play around with. They simulate games using players with strengths possessing known statistical properties, so I can measure how accurately the calculated ratings reflect the actual player strength. This also gave me an excuse to write an implementation of the not-trivial rating calculation (the derivation of which is explained in a different paper that I have not yet read and that I’d bet is a tad more involved).

The code is rough and could easily contain errors, but if you’re curious here’s the notebook and its HTML export.

In the paper, consideration is given to how player ratings change over time. The rating system adminstrator defines a rating period of a certain length of time, and after the period elapses the measure of uncertainy of each player’s rating (a special feature of the Glicko system) increases. The paper suggests some ways to determine the ideal parameters for these operations.

However, as discussed in the last post, I don’t need to worry about that! That means I can ignore that part of the system. (And in fact, if you follow the suggestions for calculating c under “Implementation issues” with the assumption that player ratings never become more uncertain over time, you end up with c = 0, which makes the over-time update calculation of the ratings deviation a no-op.)

The results from the testbeds seem pretty straightforward. Both the simple and more complicated cases follow the same rough trajectory of error reduction. Comparing the number of games to the error, the two-player version seems to reduce error a little faster than the multiple-player version. That seems to make intuitive sense; with multiple players there are more parameters to sort out, and more data is needed to get solid estimates of them.

The results also suggest that a relatively few games from each player are enough to get a ballpark estimation (in the aggregate). It will be interesting to see how this plays out with real engines (which may not follow all the theoretical assumptions) and individual cases.

While I’m tempted to do more in-depth experiments with these testbeds, I probably should move on with my research. Glickman has also published an improvement of the Glicko system, the Glicko-2 system. I’ve intentionally delayed reading about it until I had a sense of the original system, but I’m curious to see how it is different.