they also train a mapping from the board state to a probability of how how likely it is a particular move will result in winning the game (the value of a particular move).
How is this calculated?
When some termination criterion is met
Were these criterion learned automatically, or coded/tweaked manually?
1. The value network is trained with gradient descent to minimize the difference between predicted outcome of a certain board position and the final outcome of the game. Actually they use the refined policy network for this training; but the original policy turns out to perform better during simulation (they conjecture it is because it contains more creative moves which are kind of averaged out in the refined one). I'm wondering why the value network can be better trained with the refined policy network.
2. They just run a certain number of simulations, i.e. they compute n different branches all the way to the end of the game with various heuristics.
How is this calculated?
When some termination criterion is met
Were these criterion learned automatically, or coded/tweaked manually?