The outliers highlight potential differences between AlphaZero’s search and Stockfish’s concept calculation. S21–S32 show plots for AlphaZero trained from a different seed. What–when–where plots for the full set of concepts are given in SI Appendix, Figs. By using a wide range of concepts, we can build up a detailed picture of the emergence of human concepts over the course of AlphaZero’s training. Notably, Stockfish 8’s threats function and AlphaZero’s representation of thereof, as detectable by a linear probe, become more and finally, less correlated as AlphaZero becomes stronger ( Fig. What–when–where plots for a selection of concepts are given in Fig. We combine these regression scores into a single plot per concept that we refer to as what–when–where plots because they visualize what concept is being computed, where in the network this computation happens, and when during network training this concept emerged. Comparing across multiple training steps t allows us to track the evolution of these profiles. Plotting the change in test set scores of a concept probe against network depth generates a profile of where (if anywhere) the concept is being computed. Encoding of Human Conceptual Knowledgeįor a given concept, we can train and evaluate a separate probe for every network depth d. After every 1,000 (1k for brevity, the notation “k” denotes “thousand” e.g., 32k is 32,000) training steps, the networks that are used to generate self-play games are refreshed, meaning that training data in the buffer are frequently generated by a different network than the network being updated. Stochastic gradient descent steps are taken with a batch size of 4,096 in training. At most, 30 positions are sampled from a game. Positions and their associated data are sampled from the self-play buffer containing the previous 1 million positions. The network is trained to predict the MCTS move probabilities and game outcome in p and v. Training data for each gradient step consist of input positions with their MCTS move probability vectors and final game outcomes. Our experimental setup updates the parameters θ of the AlphaZero network over 1 million gradient descent training steps, an arbitrary training time slightly longer than that of AlphaZero in ref. Starting with a neural network with randomly initialized parameters θ, the AlphaZero network is trained from data that are generated as the system repeatedly plays against itself. This is not necessarily the move that AlphaZero will play when search is enabled, but training does update p toward the distribution after MCTS has been applied ( SI Appendix, section 1 has details). In this work, we only investigate the neural network component of AlphaZero, so we use the prior p directly rather than the move distribution following MCTS. The output p is typically referred to as AlphaZero’s “prior,” as it is a distribution over moves that is updated by the MCTS procedure. The outputs p and v are computed by the “policy head” and the “value head” in the AlphaZero network in Fig. A “state” consists of a current chess board position and a history of preceding positions along with ancillary information, such as castling rights, and it is represented as a real-valued vector z 0. Its Monte Carlo tree search (MCTS) component uses the neural network to repeatedly evaluate states and update its action selection rule. Summary of Results Many Human Concepts Can Be Found in the AlphaZero Network.Ĭomputes a probability distribution p for a next move and the expected outcome v of the game from a state z 0. We have made a curated set of key positions with both human and AlphaZero play data available online. We leverage the existence of a broad range of human chess concepts in conventional chess engines, such as Stockfish, to annotate positions with concept data. Thanks to databases, such as ChessBase, data on human games are plentiful, so we can compare the evolution of AlphaZero’s play during training to the evolution of move choices in top-level human chess. * With his unique perspective, we analyze qualitative aspects of AlphaZero, especially with regard to opening play. We address this issue by using behavioral analyses from a former world chess champion, V.K. Meanwhile, behavioral analysis of AlphaZero presents an obvious difficulty since its game play is so far beyond a typical player. Quantitatively, we apply linear probes to assess whether the network is representing concepts familiar to chess players. We take a quantitative and qualitative approach to interpreting AlphaZero.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |