Reinforcement learning (RL) has been used successfully for solving tasks which have a well defined reward function – think AlphaZero for Go, OpenAI Five for Dota, or AlphaStar for StarCraft. However, in many practical situations you don’t have a well defined reward function. Even a task as seemingly straightforward as cleaning a room has many subtle cases: should a business card with a piece of gum be thrown away as trash, or might it have sentimental value? Should the clothes on the floor be washed, or returned to the closet? Where are notebooks supposed to be stored? Even when these aspects of a task have been clarified, translating it into a reward is non-trivial: if you provide rewards every time you sweep the trash, then the agent might dump the trash back out so that it can sweep it up again.
Alternatively, we can try to learn a reward function from human feedback about the behavior of the agent. For example, Deep RL from Human Preferences learns a reward function from pairwise comparisons of video clips of the agent’s behavior. Unfortunately, however, this approach can be very costly: training a MuJoCo Cheetah to run forward requires a human to provide 750 comparisons.
Instead, we propose an algorithm that can learn a policy without any human supervision or reward function, by using information implicitly available in the state of the world. For example, we learn a policy that balances this Cheetah on its front leg from a single state in which it is balancing.
Humans seem to infer what other people want without needing much explicit supervision. How do we do it? As a concrete example, let’s imagine that you walk into a room, and you see this:
You’re probably going to immediately be a lot more careful, to ensure you don’t knock
This article is purposely trimmed, please visit the source to read the full article.
The post Learning What To Do by Simulating the Past appeared first on The Berkeley Artificial Intelligence Research Blog.