Many experimental works have observed that generalization in deep RL appears to be difficult: although RL agents can learn to perform very complex tasks, they don’t seem to generalize over diverse task distributions as well as the excellent generalization of supervised deep nets might lead us to expect. In this blog post, we will aim to explain why generalization in RL is fundamentally harder, and indeed more difficult even in theory.
We will show that attempting to generalize in RL induces implicit partial observability, even when the RL problem we are trying to solve is a standard fully-observed MDP. This induced partial observability can significantly complicate the types of policies needed to generalize well, potentially requiring counterintuitive strategies like information-gathering actions, recurrent non-Markovian behavior, or randomized strategies. Ordinarily, this is not necessary in fully observed MDPs but surprisingly becomes necessary when we consider generalization from a finite training set in a fully observed MDP. This blog post will walk through why partial observability can implicitly arise, what it means for the generalization performance of RL algorithms, and how methods can account for partial observability to generalize well.
Learning By Example
Before formally analyzing generalization in RL, let’s begin by walking through two examples that illustrate what can make generalizing well in RL problems difficult.
The Image Guessing Game: In this game, an RL agent is shown an image each episode, and must guess its label as quickly as possible (Figure 1). Each timestep, the agent makes a guess; if the agent is correct, then the episode ends, but if incorrect, the agent receives a negative reward, and must
This article is purposely trimmed, please visit the source to read the full article.