On the surface, reinforcement learning (RL) seems like a great method for solving dialogue tasks. We can easily model the problem as a POMDP where the partially observed state represents the user’s intent. During each turn, the dialogue agent must make a decision about how to respond. This action space is represented as either a series of tokens or simplified even further into a single dialogue act. Lastly, task-oriented dialogue offers a natural reward — whether or not the dialogue succeeded. And yet, we don’t see (m)any deployed dialogue systems trained with RL. Why is that?

## RL as Supervised Learning in Sheep’s Clothing

In order to apply RL in realistic environments, a number of adjustments are often made to make the training tractable. While classic RL operates in an online setting, training an agent in such a manner quickly becomes impractical (ie. customer service), costly (ie. robotics), or even downright unethical (ie. healthcare). So as a simplification, trajectories from a human agent are stored in an experience replay buffer for offline training.1 Since the target policy $$\pi_g$$ is obviously not the same as the behavioral policy $$\pi_b$$ used to collect the data, off-policy evaluation methods are also applied.2 Conversations within even a narrow domain are hard to model completely, so now we must adopt model-free RL.3 Finally, since the state and action spaces of conversations are unbounded, modern RL systems do away with a table of Q-values, and instead use neural networks as function approximators. In fact, why even calculate Q-values when directly optimizing for the policy with REINFORCE works just as well.4 At the end of the day, we’re left with an RL agent trained on mini-batches of examples on a loss function that takes into account neither world models nor discount factors. And since the underlying implementation in both cases is just a Transformer,5 this version of RL starts to look practically indistinguishable from regular ol’ supervised learning (SL).

It turns out, even folks working in reinforcement learning will readily admit that RL can be simply viewed as SL with a twist. Specifically, the tweak is that the RL algorithms select good slices of data before applying behavior cloning (aka. supervised learning) to master the task.6 For example, hindsight relabeling provides better data by changing the agent’s original goal with the goal that was actually achieved.7 However, this technique of labeling the data after it was collected also has its supervised learning equivalent. In fact, labeling the user intents from a large pool of unannotated conversations is precisely the most common method of providing supervision for the intent detection task.8 More critically, isn’t data manipulation just some hack to improve training stability rather than a core part of reinforcement learning methodology? The algorithmic part of the equation is behavior cloning which seems to be identical to supervised learning. So does RL really boil down to just SL wrapped with some fancy math?

## Exploring Sub-optimal Trajectories

The key benefit which allows RL to potentially outshine SL is that it can learn from non-optimal trajectories.9 In other words, the student can overcome the teacher by avoiding (or even learning from) the mistakes that the teacher has made. Based on my understanding so far, this concept can manifest itself in at least three different ways: maximizing good outcomes, minimizing bad outcomes, and changing bad outcomes into good ones.

First off, RL learning encourages exploration which allows for finding better reward regions.10 Since RL operates with sparse rewards, sometimes going to a bad part of the environment is ok as long as the agent eventually learns to return back to the good part. This exploration occasionally pays off when the agent discovers areas of high reward that mindless imitation of past behavior could never visit. Secondly, trajectories that end in a bad outcome are not as damaging because the reward signal will automatically lead to the model to ignore such experiences.11 This useful mechanism is not found in supervised learning, as evidenced by the pervasive issue of generating boring, non-committal responses (ie. I don’t know) when such dialogue is clearly not ideal. The SL agent simply copies what it sees most, whereas the RL agent is able to downplay such utterances. Lastly, an RL agent can take advantage of a imperfect outcomes by stitching together the good parts of bad trajectories. 12 It can also take advantage of poor outcomes as examples of what not to do. Incorporating a penalty into training is trivial by adding a negative reward. While modifying the loss function within supervised learning is certainly also possible, figuring out the exact formulation is not as straightforward. Lest we forget, our previously mentioned hindsight replay is another way to turn dirt into gold.13

What we saw earlier in the previous section was not a single, isolated case of RL providing better data, but actually a part of larger trend where a RL model achieves superior performance by directly tackling the data problem along with the modeling (policy). How realistic are these advantages though? While these three cases allow RL to theoretically outperform SL, does this play out in practice? To start, we’ve already discussed the catastrophic consequences of deploying an RL agent to learn from real users as it explores unsuccessful paths. So while the model won’t be harmed too much by sub-optimal trajectories, the product surrounding the model will suffer. To get around this, one could consider developing a user simulator for to mimic customer behavior.14 Much like the gaming environment of Atari allowed RL to achieve super-human performance on video games, an accurate user simulator should allow an agent to take full advantage of what RL can offer.

## Ideal Data Generators for RL

If the key to unlocking reinforcement learning is to design a robust user simulator, how does one go about doing that exactly? Well, a reinforcement learning environment is expected to take in the current state along with an agent action to produce the next state along with its associated reward.11 This implies that a proper user simulator contains two components, a representative model of the user to output next states and an accurate method of evaluation to output well-calibrated rewards.

Touching upon the user model first, we argue this is essentially intractable since there is no feasible way to predict how a user would react in any given situation. If we knew what the user wanted, we wouldn’t need agent interaction in the first place, and could just fulfill the customer request immediately. In the context of games, the goal of an RL agent is obviously to maximize the score. But humans don’t have a come with a universal scoring function. Even for the same task, different people might want completely different things. For example, two groups want to book a dinner reservation at the same Italian restaurant, but one group is for a large party with complex dietary restrictions, while the other is a romantic dinner for two. Frankly, even for the same person, the ideal outcome differs over time. Consider someone buying a movie ticket for one week, but wanting to watch a different movie the next week. Building a truly accurate model would entail constructing a new environment for every user. Clearly, we must relax the assumptions about users and instead assume that there exist patterns among different users even if their individual circumstances are slightly unique. We can operationalize the idea that two different users want the same things by offering the agent identical rewards when performing some desired action that equally benefits both users.

What we’ve learned so far is that reinforcement learning is beneficial only insofar as we can build useful user simulators for producing agent experiences, and that the only aspect of user simulators we can realistically control is their ability to generate appropriate reward signals. However, real-life has no intrinsic rewards, so how do we determine dialogue success? Open domain chat includes subjective measures such as user satisfaction and dialogue fluency15, but defining success in task-oriented dialogue is also imprecise.16 Suppose the user is booking a flight and the goal is to select the correct flight from a set of options with a KB. Forget the extra details of dealing with discount codes or fees for checked bags. Let’s assume the agent is given the straightforward task of collecting information on a finite set of constraints, such as desired price range, departure and destination locations, and number of seats. Then suppose the agent successfully books the correct flight that meets all such criteria and deserves a (discounted) reward for all steps taken. Where does this reward score come from? We don’t know a priori what the user wanted, so a human would need go in after the fact to mark the conversation as successful. To the extent that we are just labeling data, it doesn’t seem like reinforcement learning is any better (or any worse) than training under a supervised learning paradigm.

Ultimately, RL algorithms are great, but the real limitation is the ability to quickly and scalably label collected conversations. To take it a step further, one might say the real problem comes down to getting the right data. Of course, that was always the key to begin with, since good data is the answer to everything on this blog ;)