I recently started dipping my toes into Reinforcement learning. I followed some tutorials and developed some basic applications like a self driving car simulation, AI agent that plays Doom, and Breakout.

After getting a basic idea on how all of this works, I decided that it is time to take up a project of my own. I wanted to make two AIs fight each other in a game of chess, and learn from each other. This is in no way anything groundbreaking. Chess has been a subject of interest in the Artificial Intelligence community for many decades. DeepMind’s AlphaZero also uses Deep Reinforcement Learning, and plays millions of games against itself to develop the best strategies.

Instead of mimicking AlphaZero, I wanted to develop a simple pair of Neural Networks that learns to play the game by competing against each other. However, as it turns out, Chess is nothing like a self-driving car simulation or any Atari games. It is a very complex game with so many possible outcomes. As a result, what I thought would be a simple and fun personal project turned out to be an extremely challenging task that took me a very very long time.

Before we begin, let me tell you what my plans for this project were. I want to develop a chess environment where:

The two agents (white and black) know nothing about the game beyond the basic rules.

At first, each agent play seemingly random moves, but learn to be better at the game over time.

That’s it. Those are the only things that I need. Seems simple enough right? WRONG! Let me tell you why.

Since the agents learn by playing against themselves through trial and error, it is clear that the system needs to use reinforcement learning. In any reinforcement learning project, there are three different areas we need to think about.

State

Action

Reward

We look at the state of our environment (in this case, our chess board), and decide to perform an action. Thereafter, we evaluate how good of an action we performed based on a reward we get. Let me take you through these three areas in order of difficulty.

## Evaluating the reward

The process of evaluating the reward for an action is very simple in chess. Most of the work is already done for us.

(source: chesskid.com)

Each piece in chess is assigned a value based on its importance. So, as an example, if the white agent takes black’s Queen, it gets a reward of +9, while black gets a reward of -9.

You would also notice that the King does not have a value. That’s because we technically can’t assign a value to it because the King is priceless. But since we need to give a reward for taking the king, I assigned a value of 255 to the king, which is much higher than all of the pieces combined, and I think that this value signifies its importance.

Another small change that I made was that if an agent makes a move that does not take any of the opponent’s pieces, it gets a small negative reward of -0.1. This is because I noticed that the agent kept moving the same piece back and forth without doing anything, especially in the early stages of training.

As you can see, the Black Knight and White Queen is repeating the same moves over and over again until we run out of moves. We can discourage such moves, and encourage the agent to take the opponent pieces by assigning a small negative reward when it does not capture an opponent piece.

## Representing the State

The state of a chess environment is represented by the board, and the pieces in each square. A standard chess board is an 8x8 grid-like structure.

Each unit in this grid can be empty or have one of six different types of pieces. How do we represent this?

We could represent the state of the board by giving text values to each piece in the board (eg: ‘wp’ for white pawn). However, Deep Learning algorithms do not play well with text values. We need numbers representing each piece.

We can give each piece a numeric value (eg: -1 for a black pawn, +1 for white pawn, and a similar numbering scheme for other pieces). Squares with no pieces are represented with a value of 0. Is this the best solution? Almost! Can you think of any problems with this?

Now that we are representing each square with numeric values, we have unintentionally given weights to each type of piece. That is, a white queen (5) has a numeric representation that is five times as much as a white pawn (1). Therefore, any Neural Networks that we train may be inclined to give a higher priority to move the queen than the pawn.

How do we solve this? An interesting solution proposed in the paper “Training a Convolutional Neural Network to Evaluate Chess Positions” is to use one-hot encoding. Since piece information such as pawn, rook, queen, etc. are categorical values, these categorical data can be replaced by binary values, one for each category. Each category of the input is set to 1, and all others are set to zero.

Did that make sense? No? Let me explain it this way. The board can have 6 possible pieces of each colour. Therefore, each piece can be denoted by a 1x6 matrix, where the index corresponding to each piece is denoted by 1.

As you can see, each piece is represented by a 1x6 matrix. The first index (index 0) corresponds to pawns. If the first index has a value of 1, we can deduce that the square has a white pawn. If it has a value of -1, it has a black pawn. A similar structure is followed for other pieces. If all indexes are 0, there are no pieces in the square.

As a result, each square in the 8x8 grid is represented by a 1x6 matrix. You can see where we’re going with this. Our final result is that the state of the environment is represented by a 3-dimensional matrix of the size (8, 8, 6).

## Selecting an action

Now comes the hard part. It is already clear that we will be using a deep neural network for our reinforcement learning model. The task of this network is to select an action based on the state of the environment. This is where things start to get a lot more complicated.

At any given state, there are many possible actions that can be taken. Here are the issues I came across while trying to come up with a solution for this:

The number of possible actions depend on the state of the environment.

How is an action going to be represented?

If I want to move a piece, how do I decide which piece to move (eg: which of the 8 pawns do I choose)?

Keep in mind that the state of our environment is represented by an (8, 8, 6) matrix. We can think of this as an 8 x 8 image. Each 6 layers along the piece dimension of the matrix can be considered as a separate colour channel. Going by this analogy of comparing the state of the chess board to an image, I decided to use a 2D convolutional neural network to analyze the input state.

I went on a journey scouring the ends of the Earth to find the best possible solution for selecting an action. There weren't many solutions to the kind of actions that could be taken in a reinforcement learning model. In the end, the best possible solution that I came across was proposed in the paper “Predicting Moves in Chess using Convolutional Neural Networks”.

This paper has developed a model that predicts the next move taken by a particular user given the current state, by training against a dataset of over 20,000 games played by skilled players. This is not exactly what I was planning to do, but the underlying concept of generating an action was useful to me. This is where things become interesting, so strap in your seat belts.

The paper defines the task of selecting an action as a two-part classification problem. The first part is training a convolutional neural network (CNN) to select a piece to be moved. They call this the “piece selector CNN”. The second part is training 6 other CNNs (yes, six, you read that right) to encode which coordinates would be advantageous to put each of the six possible pieces on. We call these the “move selector CNN”.

The piece selector network decides which coordinate a piece needs to be moved out of. According to the authors of the paper, this captures the notion of escape when a piece is under attack (ex: king needs to move to avoid being taken).

The network takes the board’s state as input and then outputs a value indicating how desirable it is to move a piece out of a given square of the grid. According to the diagram above, the network produces an output with 64 different values. These are the values of all squares in the grid.

All values of the squares with no pieces, or pieces from the opponent team are clipped to 0. Every other value denotes how desirable it is to move a piece in this square.

The move selector networks take the state of the environment as their inputs as well. Each move selector network determines how desirable it is to have a piece of the corresponding network in the given square of the grid.

If we take the Bishop network as an example, it produces an output of a distribution of values with respect to each square of the board on how desirable it is to have a bishop in the corresponding square. All of the squares the bishop cannot move to are clipped to 0.

After all of these values are taken, we determine the optimal action by composing the piece selector network (pCNN) with all move selector networks (mCNN) by multiplying the values in the pCNN with the highest value of the corresponding mCNN, and taking the largest value obtained over the entire composition.

This results in an output of the position of the piece in the grid, and the position that piece must be moved to. This output is what we use to perform the action.

Phew! That was a handful. So what you need to keep in mind is that each AI agent has a piece selector network and 6 move selector network. That’s 7 neural networks in total. Given that we have two agents playing against each other, we will have a grand total of 14 neural networks! This should be fun.

## Convolutional Neural Network Architecture

Now that we have come up with a way to select an action given an input state, let’s talk about how we designed the architecture for all of these networks.

All seven networks take the same input, which is the representation of the state of the board, and provide an output which indicates the desirability of a position in the board. As mentioned earlier, since the input state is very similar to the structure of an image, we will be using convolutional neural networks for this.

This uses a three layer CNN in the form of 32, 128, and 32 filters. After the input state is convolved, it gets flattened into 128 nodes, which thereafter produces an output of 64 nodes.

### Pooling

Since the input state is an 8x8 grid with 6 channels, the values indicated by each square on the grid contains a great deal of information. As a result, the neural network did not use any pooling so as to preserve as much data as possible.

### Weight Initialization

The weights were adjusted to very low values to match the data made up of -1, 0, and 1 in the input layer. The authors of the paper have cross validated the order of the magnitude of the initializations and determined that 10^(-7) is the optimal initialization for parameters in the first layer, and using larger initializations in the deeper layers of 10^(-6) when the data is less sparse and sensitive to bad initial forward propagations.

### Loss Function

Since the neural networks are trained with reinforcement learning, calculating the loss function is a bit more complicated. The first step is to calculate the Q-value of the action that was taken. The Q-value is calculated with the following formula.

Q(s,a) = R(s,a) + gamma * max Q(s',a')

Q(s,a): Q-value obtained by performing action ‘a’ while in state ‘s’

R(s,a): Reward obtained by performing action ‘a’ while in state ‘s’

gamma: discount factor (set to 0.99)

max Q(s',a'): maximum Q-value after performing the action and going into the state ‘s`’

Calculating the reward

Therefore, in order to calculate the q-value, we must first calculate the reward obtained from the action. One important thing to keep in mind is that we cannot complete the reward calculation immediately after moving a piece. That is because we need to evaluate how the opponent responded to our action.

For example, if you perform an action where you take the opponent’s pawn, you get a reward of +1. However, if the opponent responds to it by taking your queen, you get a reward of -9. Therefore, the action we performed has a cumulative reward of -8. We need to take the opponent’s response into account when calculating the reward because:

If not, out actions will only have positive rewards

Our AI agent needs to know that the action it performed left an opening for its queen to be captured.

Calculating the maximum Q value of the new state

Once the agent performs an action, it moves into a new state. Keep in mind that our agent is dealing with 7 neural networks, so calculating the maximum q-value is a bit tricky. This is calculated by first identifying the action the agent would take in the new state.

Then we return the respective piece values and move values of the action that was taken. Thereafter, the q-value for the piece selector network is calculated by using this piece value on the above formula.

A similar approach is followed for the move selector network. However, this is only done for the piece that was moved in the action.

Once these q-values are identified, the neural networks are backpropagated by calculating the loss function between the piece/move value with their respective q-values. The Mean Squared Error (MSE) loss function is used to calculate this loss value, along with ADAM for the optimizer.

Now, our AI agents are ready to battle against each other.

## Results

I first ran the training locally, but then came to the conclusion that training 14 neural networks is extremely computationally intensive, and it is going to take me ages. In fact, I wrote everything in this post up to this point while waiting for the model to train.

After nearly a whole day, I couldn’t even get through training 10,000 games. To give some context, AlphaZero was trained on millions of games. So I was not going to get anywhere close. It was then that I decided to get a GPU server on AWS and train the agents there.

If you’re wondering if this was able to do just as well as AlphaZero, the answer is no. Obviously. AlphaZero is an extremely well optimized network developed by hundreds of the world’s best AI researchers. Also, AlphaZero was trained for about 9 hours, which I assume would be pretty powerful computer hardware.

It took me a few hours, but I managed to train the networks for around 100,000 games on my new hardware. Which otherwise would have taken me weeks. As for the outcome, I think the network did a decent job, but it could be a lot better.

In the first 1,000 games, the moves made by the agents were completely random.

The agents would move their pieces to positions where they can be taken with no consequences.

The kings move around the board completely open to be captured.

Opponent pieces move next to the king without capturing him.

Between 1,000 - 10,000 games, both agents occasionally try to capture the opponent pieces.

When trying to capture opponent pieces, the agents often leave their own pieces open to be captured.

Between 10,000 - 50,000 games, the agents try to capture opponent pieces more aggressively. However, they still leave their own pieces open to be captured.

Between 50,000 - 100,000 games, the agents begin to understand the concept of “escape”, where they sometimes learn to move the pieces to avoid getting captured.

If you're interested in the project implementation, you can find the code on GitHub!