On the surface, reinforcement learning (RL) seems like a great method for solving dialogue tasks. We can easily model the problem as a POMDP where the partially observed state represents the user’s intent. During each turn, the dialogue agent must make a decision about how to respond. This action space is represented as either a series of tokens or simplified even further into a single dialogue act. Lastly, task-oriented dialogue offers a natural reward — whether or not the dialogue succeeded. And yet, we don’t see (m)any deployed dialogue systems trained with RL. Why is that?
Conversational AI was all the rage a few years back, when people were shouting from the rooftops that chatbots were going to take over the world. But for all the fanfare and hullabaloo, the trumpeting of a new era has given away lately to a low, dull roar. Depending on who you ask, we either have AGI right around the corner, or all this noise is simply over-hyped technology soon to float away like vaporware of the past. I believe the more likely outcome is that the true answer lies somewhere in the middle – there will be a revolution, but it won’t happen overnight. Instead, changes will start out incremental as the technology is rolled out and users will slowly adopt new social norms around dealing with virtual assistants. I don’t claim to know when this will happen or exactly what it will look like, but certainly there are some clues.
Training dialogue agents for real-life use cases is immensely difficult since manual data annotation quickly hits scaling issues. One way around this is to build a user simulator which can theoretically then generate tons of examples for the agent to learn from. However, to build a system representing the user, you would need a model that understands how to react and respond to agents. But to train such a user model you would then need some dialogue system that acts as an agent. So we have a chicken-and-egg problem, right? Well, not quite. There is at least one key distinction between a user simulator and a dialogue agent.
Data augmentation methods are a staple when training computer vision models, with methods like flipping, resizing, cropping and blurring used so ubiquitously that they are a foregone conclusion in most systems. These methods help improve model robustness such that anyway you change the image of a cat, the model still recognizes the item in the picture as a cat. This is relatively straight forward since all aforementioned techniques keep the main object the same such that a cat remains a cat, and does not somehow magically morph into a dog. But does this work for NLP as well?
In order for a virtual assistant to be useful, the agent should do more than just information retrieval and basic chit-chat. Rather than pattern recognition on the response level, the agent should be able to perform pattern recognition on the discourse level so it can mimic human-reasoning (even as true understanding remains an elusive goal). If a model were to reason about an utterance, it must have been trained to do so. Furthermore, we argue that such training must be explicitly performed through (weakly) supervised learning, rather than implicitly extracted from a large pre-trained LM (eg. through careful prompting).