BASALT: A Benchmark for Learning from Human Feedback

TL;DR: We are launching a NeurIPS competition and benchmark called BASALT: a set of Minecraft environments and a human evaluation protocol that we hope will stimulate research and investigation into solving tasks with no pre-specified reward function, where the goal of an agent must be communicated through demonstrations, preferences, or some other form of human feedback. Sign up to participate in the competition!

Deep reinforcement learning takes a reward function as input and learns to maximize the expected total reward. An obvious question is: where did this reward come from? How do we know it captures what we want? Indeed, it often doesn’t capture what we want, with many recent examples showing that the provided specification often leads the agent to behave in an unintended way.

Our existing algorithms have a problem: they implicitly assume access to a perfect specification, as though one has been handed down by God. Of course, in reality, tasks don’t come pre-packaged with rewards; those rewards come from imperfect human reward designers.

For example, consider the task of summarizing articles. Should the agent focus more on the key claims, or on the supporting evidence? Should it always use a dry, analytic tone, or should it copy the tone of the source material? If the article contains toxic content, should the agent summarize it faithfully, mention that toxic content exists but not summarize it, or ignore it completely? How should the agent deal with claims that it knows or suspects to be false? A human designer likely won’t be able to capture all of these considerations in a reward function on their first try, and, even if they did manage to have a complete set of considerations in mind, it might be quite difficult to translate these conceptual preferences into a reward function the environment can directly calculate.

Since we

This article is purposely trimmed, please visit the source to read the full article.

The post BASALT: A Benchmark for Learning from Human Feedback appeared first on The Berkeley Artificial Intelligence Research Blog.

This post was originally published on this site