Recent years have demonstrated the potential of deep multi-agent reinforcement learning (MARL) to train groups of AI agents that can collaborate to solve complex tasks – for instance, AlphaStar achieved professional-level performance in the Starcraft II video game, and OpenAI Five defeated the world champion in Dota2. These successes, however, were powered by huge swaths of computational resources; tens of thousands of CPUs, hundreds of GPUs, and even TPUs were used to collect and train on a large volume of data. This has motivated the academic MARL community to develop MARL methods which train more efficiently.
DeepMind’s AlphaStar attained professional level performance in StarCraft II, but required enormous amounts of computational power to train.
Research in developing more efficient and effective MARL algorithms has focused on off-policy methods – which store and re-use data for multiple policy updates – rather than on-policy algorithms, which use newly collected training data before each update to the agents’ policies. This is largely due to the common belief that off-policy algorithms
This article is purposely trimmed, please visit the source to read the full article.