chess-and-rl
Response to Chess engines do weird stuff (has quite a wild home page layout) AlphaZero, RL, and SPSA, Cosmo Bobak
I argue forcefully against the notion that the self-play loop is in some sense “not necessary”, or even “only necessary one time”. Distillation from a fixed oracle has a ceiling: the student can approach, but never exceed, the quality of the teacher’s data. To surpass that ceiling, you must search-amplify the new network, generating better data than the old oracle could, and distill again — and this is precisely the self-play loop. The distance from random play to superhuman play is not crossed in one leap.
This post is licensed under
CC BY 4.0
by the author.