*they also train a mapping from the board state to a probability of how how like...

sawwit · on Jan 27, 2016

1. The value network is trained with gradient descent to minimize the difference between predicted outcome of a certain board position and the final outcome of the game. Actually they use the refined policy network for this training; but the original policy turns out to perform better during simulation (they conjecture it is because it contains more creative moves which are kind of averaged out in the refined one). I'm wondering why the value network can be better trained with the refined policy network.

2. They just run a certain number of simulations, i.e. they compute n different branches all the way to the end of the game with various heuristics.