Batch MCTS, Part 2
The previous post described a batch-inference MCTS implementation for self-play. Batches consisted of one state from each of N concurrent episodes. Unfortunately that approach doesn’t work for playing a competitive episode since in that case there’s only one episode we care about.
This paper describes a multi-threaded shared-data parallel MCTS implementation which uses a concept of “virtual loss” to discourage additional threads from exploring a subtree that one thread is currently exploring.
I decided to implement a similar single-episode batch MCTS approach for competitive play which consists of two phases:
Expand: sequentially begin N simulations from the root node. During each simulation, increment the visit count of visited nodes, causing subsequent simulations to assign lower upper confidence bound values to those nodes and hence be less likely to select them (the same effect as “virtual loss”). At the leaf nodes, submit an inference request, and return a Promise which will be resolved with the inference result. Intermediate nodes use the returned Promises to schedule updates of their own values.
Backpropagate: perform all of the requested inferences as a single batch and use each result to fulfill the associated Promise. Those fulfillments trigger a cascade of Promise continuations which backpropagate the new leaf values to the root.
This approach does not have exactly the same behavior as sequential MCTS with the same number of simulations, since action selection in later simulations during a batch doesn’t take advantage of node value updates caused by earlier simulations during the batch. I hypothesize that the batching is still worthwhile because:
Simulations should have small effects on node values as the number of simulations increases.
The number of simulations possible during the same amount of time is so much greater that the additional simulations seem likely to bring more value than is lost due to less optimal action selection within batches.
For self-play the concurrent episode implementation described in the previous post is still probably better since it gains batching without compromising action selection.