AlphaZero modifications

I had to make a couple of changes to AlphaZero to apply it to games with randomness and more than two players.

More than two players

Handling more than two players is the easy one. AlphaZero uses values in the range 0-1 to represent state values for the current player; the value for the other player is 1 - value. When there’s more than one player, state values can’t be reduced to a single number in the same way. I use a map from player to value for every player in the episode. One point of value is available for each pair of players: if tied, 0.5 points each, and otherwise 1 to the player who finished ahead and 0 to the player who finished behind.

A consequence of this approach is that the model doesn’t care about the margin of victory: when it thinks it’s guaranteed to win it has no preference between its possible moves and will play in ways that look surprising to a human. An improvement to address that issue could be to allocate a portion of the points based on the margin between the player and the player finishing ahead or behind that player. That change would also add some smoothness that could reduce the impact of value noise for actions that are visited infrequently during tree search.

Another possible change would be to normalize these values such that their sum is 1.

Randomness

To handle randomness I introduced “chance nodes” in the search tree. Chance nodes are children of action nodes and parents of state nodes. Game implementations produce both a new state and a “chance key” when applying an action to a state. The chance key’s function is to uniquely identify the associated new state among all of the possible states resulting from the same action. For example if the random event is a single d6 roll, the chance key could be the number of pips rolled. It’s essentially an optimization to avoid comparing entire state values for equality.

A configurable maximum number of chance nodes are retained by action nodes with the least-visited nodes being expunged when needed.

Game logic is not responsible for knowing or producing the full distribution of chance outcomes. Each simulation generates a new random outcome and the resulting distribution of values should converge to the true chance-weighted distribution as the number of simulations increases.

A potentially serious issue with this approach is that, depending on the game, chance nodes may result in an enormous branching factor that limits the depth of tree search. For example in Kingdomino the second draw of tiles has 44×43×42×41=3258024 possible outcomes. Each of those outcomes will be visited no more than once in practice so search depth beyond that point will be one. Simulations will still back-propagate a reasonable distribution of values from the model value function following these random events but AlphaZero’s ability to deeply search promising regions of the tree will be crippled. I hope that this scheme can still generate strong play but it remains to be proven.