reinforcement learning – Pertinent Observations

It’s been over a year since Google’s DeepMind first made its splash with the reinforcement-learning based chess playing engine AlphaZero. The first anniversary of the story of AlphaZero being released also coincided with the publication of the peer-reviewed paper.

To go with the peer-reviewed paper, DeepMind has released a further 200 games played between AlphaZero and the conventional chess engine StockFish, which is again heavily loaded in favour of wins for AlphaZero, but also contains 6 game where AlphaZero lost. I’ve been following these games on GM Daniel King’s excellent Powerplaychess channel, and want to revise my opinion on AlphaZero.

Back then, I had looked at AlphaZero’s play from my favourite studs and fighter framework, which in hindsight doesn’t do full justice to AlphaZero. From the games that I’ve seen from the set released this season, AlphaZero’s play hasn’t exactly been “stud”. It’s just that it’s much more “human”. And the reason why AlphaZero’s play possibly seems more human is because of the way it “learns”.

Conventional chess engines evaluate a position by considering all possible paths (ok not really, they use an intelligent method called Alpha-Beta Pruning to limit their search size), and then play the move that leads to the best position at the end of the search. These engines use “pre-learnt human concepts” such as point count for different pieces, which are used to evaluate positions. And this leads to a certain kind of play.

AlphaZero’s learning, process, however, involves playing zillions of games against itself (since I wrote that previous post, I’ve come back up to speed with reinforcement learning). And then based on the results of these games, it evaluates positions it reached in the course of play (in hindsight). On top of this, it builds a deep learning model to identify the goodness of positions.

Given my limited knowledge of how deep learning works, this process involves AlphaZero learning about “features” of games that have more often than not enabled it to win. So somewhere in the network there will be a node that represents “control of centre”. Another node deep in the network might represent “safety of king”. Yet another might perhaps involve “open A file”.

Of course, none of these features have been pre-specified to AlphaZero. It has simply learnt it by training its neural network on zillions of games it has played against itself. And while deep learning is hard to “explain”, it is likely to have so happened that the features of the game that AlphaZero has learnt are remarkably similar to the “features” of the game that human players have learnt over the centuries. And it is because of the commonality in these features that we find AlphaZero’s play so “human”.

Another way to look at is from the concept of “10000 hours” that Malcolm Gladwell spoke about in his book Outliers. As I had written in my review of the book, the concept of 10000 hours can be thought of as “putting fight until you get enough intuition to become stud”. AlphaZero, thanks to its large number of processors, has effectively spent much more than “10000 hours” playing against itself, with its neural network constantly “learning” from the positions faced and the outcomes of the game reached. And this way, it has “gained intuition” over features of the game that lead to wins, giving it an air of “studness”.

The interesting thing to me about AlphaZero’s play is that thanks to its “independent development” (in a way like the Finches of Galapagos), it has not been burdened by human intuition on what is good or bad, and learnt its own heuristics. And along the way, it has come up with a bunch of heuristics that have not commonly be used by human players.

Keeping bishops on the back rank (once the rooks have been connected), for example. A stronger preference for bishops to knights than humans. Suddenly simplifying from a terrifying-looking attack into a winning endgame (machines are generally good at endgames, so this is not that surprising). Temporary pawn and piece sacrifices. And all that.

Thanks to engines such as LeelaZero, we can soon see the results of these learnings being applied to human chess as well. And human chess can only become better!

The big news of the day, as far as I’m concerned, is the victory of Google Deepmind’s AlphaZero over Stockfish, currently the highest rated chess engine. This comes barely months after Deepmind’s AlphaGo Zero had bested the earlier avatar of AlphaGo in the game of Go.

Like its Go version, the AlphaZero chess playing machine learnt using reinforcement learning (I remember doing a term paper on the concept back in 2003 but have mostly forgotten). Basically it wasn’t given any “training data”, but the machine trained itself on continuously playing with itself, with feedback given in each stage of learning helping it learn better.

After only about four hours of “training” (basically playing against itself and discovering moves), AlphaZero managed to record this victory in a 100-game match, winning 28 and losing none (the rest of the games were drawn).

There’s a sample game here on the Chess.com website and while this might be a biased sample (it’s likely that the AlphaZero engineers included the most spectacular games in their paper, from which this is taken), the way AlphaZero plays is vastly different from the way engines such as Stockfish have been playing.

I’m not that much of a chess expert (I “retired” from my playing career back in 1994), but the striking things for me from this game were

the move 7. d5 against the Queen’s Indian
The piece sacrifice a few moves later that was hard to see
AlphaZero’s consistent attempts until late in the game to avoid trading queens
The move Qh1 somewhere in the middle of the game

In a way (and being consistent with some of the themes of this blog), AlphaZero can be described as a “stud” chess machine, having taught itself to play based on feedback from games it’s already played (the way reinforcement learning broadly works is that actions that led to “good rewards” are incentivised in the next iteration, while those that led to “poor rewards” are penalised. The challenge in this case is to set up chess in a way that is conducive for a reinforcement learning system).

Engines such as StockFish, on the other hand, are absolute “fighters”. They get their “power” by brute force, by going down nearly all possible paths in the game several moves down. This is supplemented by analysis of millions of existing games of various levels which the engine “learns” from – among other things, it learns how to prune and prioritise the paths it searches on. StockFish is also fed a database of chess openings which it remembers and tries to play.

What is interesting is that AlphaZero has “discovered” some popular chess openings through the course of is self-learning. It is interesting to note that some popular openings such as the King’s Indian or French find little favour with this engine, while others such as the Queen’s Gambit or the Queen’s Indian find favour. This is a very interesting development in terms of opening theory itself.

Frequency of openings over time employed by AlphaZero in its “learning” phase. Image sourced from AlphaZero research paper.

In any case, my immediate concern from this development is how it will affect human chess. Over the last decade or two, engines such as stockfish have played a profound role in the development of chess, with current top players such as Magnus Carlsen or Sergey Karjakin having trained extensively with these engines.

The way top grandmasters play has seen a steady change in these years as they have ingested the ideas from engines such as StockFish. The game has become far more quiet and positional, as players seek to gain small advantages which steadily improves over the course of (long) games. This is consistent with the way the engines that players learn from play.

Based on the evidence of the one game I’ve seen of AlphaZero, it plays very differently from the existing engines. Based on this, it will be interesting to see how human players who train with AlphaZero based engines (or their clones) will change their game.

Maybe chess will turn back to being a bit more tactical than it’s been in the last decade? It’s hard to say right now!

Tag: reinforcement learning

AlphaZero Revisited

AlphaZero defeats Stockfish: Quick thoughts