OrangeBeta Blog - The awesome MCTS

We explored some game playing algorithms in the previous post and we'll continue with...

Monte Carlo (Tree Search AI)

Monte Carlo methods are a class of algorithms that use repeated random sampling to obtain numerical results. A beautiful example of this is calculating π. We start with the following problem. What is the ratio between the blue and the white square (including covered parts) area in this diagram?

ratio problem

We could calculate it. So the blue circle area is

πr²

and the white square are would be

4r²

So the ratio between areas would be

π : 4.

Let's say we now choose a few random points (sample) in this graph and divide them into points inside the circle and points outside the circle.

pi sampling 1

In this case we get 36 points with 25 points inside the circle. So ratio between these points should be the same as the ratio between π and 4. So let's calculate the value of π!

π ≈ (25 / 36) * 4 ≈ 2.77

Not very accurate. But we can increase the number of samples.

pi sampling 2

This time we have 97 points with 77 points inside the circle.

π ≈ (77 / 97) * 4 ≈ 3.17

Much better! We can also use this method to calculate the area of the blue circle, without knowing the value of π. The area of the blue circle is the ratio between points inside the circle and all points, multiplied by area of the white square.

(PointsInside/AllPoints) * 4 * r²

There are two points to note here. First, the result of this method returns an approximation regardless of the sample size. It becomes more accurate with larger sample but the sampling can be stopped at any time. Second, our area calculation function is now general and is no longer tied to the shape of a circle. It will work for any shape.

shape sampling 2

(35 / 97) * 4 * r² = 1.44 * r²

Monte Carlo Tree Search AI

There are four distinct phases in MCTS: selection, expansion, simulation and back-propagation but we will introduce them slightly out of order.

Simulation

simulation

In game-playing algorithms, we're also always calculating ratios. We're mostly interest in ratios between games played and games won. Our next question is then, how do we randomly sample the results of a game position? The only way to do this generally is to play the game to the end. In MCTS terminology we call that a playout or a rollout. Technically - we don't want a random, but a representative sample. Hence some algorithms use light (random moves) or heavy (heuristically guided choice of moves) playouts.

The result of a a playout is a simple result, a win or a loss. The result of a number of playouts is a ratio of wins and losses. So this is in essence a very general heuristic that is applicable for all games. The win/loss ratio of playouts represents the score of the position. Applying this heuristic to the position is called a simulation.

Back-propagation

simulation

The results of playouts are back-propagated to the root of the game-tree. At each node the number of simulations and wins for each player is updated.

Selection

simulation

We could use this heuristic in a normal MinMax algorithm. And it would work, yet MCTS works a bit differently. While MinMax visits all legal moves indiscriminately, MCTS considers which moves are worthwhile exploring. The decision-making formula is an interesting - heuristic - called UCB-1 (upper confidence bound, version 1). The formula combines two opposing ideas on which moves are worth considering:

Exploration

This force is pulling towards trying out as many different moves (width-first) even those that seem to be bad (like sacrificing a piece, playing a seemingly slow move). The formula for this part is:

exploration_parameter * sqrt(ln(number_of_all_playouts_from_position) / number_of_playouts_for_move))

What is important to understand is that it grows the more a move is neglected.

Exploitation

This force is pulling towards exploring in depth the most promising line of play. This one is the simple ratio of wins/all playouts for a move:

number_of_wins_for_move / number_of_playouts_for_move

What is important to understand is that it grows the better the move is.

The final formula simply adds the two parts together. This decision mechanism assures that the most promising (win/loss) moves will be explored as much as possible, while even non-promising moves will be visited from time to time. If at some depth the non-promising move finds a good variation - it will jump in value and the UCB-1 will direct more resources toward that move.

So, the MCTS algorithm always starts at the root node of the game tree and uses this formula to decide which move to chose. It continues to do so, until a leaf node (a node with no child nodes) is reached.

Expansion

simulation

Finally - when the algorithm reaches a move which has not yet been visited, it creates a new node.

This results in an asymmetric graph, where some variations are considered in depth, while others are barely visited.

simulation

These are the strengths of the MCTS. Promising moves are exploited. All sibling moves are explored. The position evaluation heuristic is general (works for all games) and it is implemented by simply implementing winning condition. The algorithm can be stopped at any time and get an approximate result.

Performance

However - as with the π calculation example - to get a good approximation, a lot of sampling is required. There are many factors that can determine the speed of playouts calculations. On the largest boards, Annexation AI calculates about 70.000 playouts (games, played out from current position to the end) per second.

The actual calculation of playouts will be the subject of the next post.