Conversational AI was all the rage a few years back, when people were shouting from the rooftops that chatbots were going to take over the world. But for all the fanfare and hullabaloo, the trumpeting of a new era has given away lately to a low, dull roar. Depending on who you ask, we either have AGI right around the corner, or all this noise is simply over-hyped technology soon to float away like vaporware of the past. I believe the more likely outcome is that the true answer lies somewhere in the middle – there will be a revolution, but it won’t happen overnight. Instead, changes will start out incremental as the technology is rolled out and users will slowly adopt new social norms around dealing with virtual assistants. I don’t claim to know when this will happen or exactly what it will look like, but certainly there are some clues.
It seems that from the technological perspective, there are a set of seven criteria that will usher in the wave of a Conversational AI revolution. Is this a hot take? Perhaps a little, but anyone who disagrees would have a hard time making a convincing case otherwise. Without further ado, the list goes as follows:
Long-term Context
Going back ten years, being able to handle context from multiple turns back in the dialogue seemed like an impossibility. The best dependency parsers would struggle to perform proper co-reference resolution even within the same utterance, much less across utterances. With the explosion in performance provided by Transformers though, this has completely changed. Models now routinely handle documents with many paragraphs and conversations with many turns. While this problem is largely solved now, we should recognize that it certainly is a prerequisite for usable and useful conversational AI systems.
Syntax and Coherency
If you take a Transformer and pre-train it with masked language modeling (MLM) or just regular language modeling (LM) you get BERT and GPT, respectively. These models have shown tremendous gains in producing fluent and coherent speech across a wide variety of applications, including dialogue. More recently, you can combine these encoder and decoder components to get a seq2seq model, such as T5 or BART, which can theoretically handle all of the above. Perhaps most critically, scaling these models to 100s of billions of parameters and training them with giant amounts of data leads to super-human performance across a number of NLP tasks. Once again, although the LaMDAs and Chinchillas have largely solved the syntax problem, this wasn’t the case a decade ago, and the progress should be recognized.
Consistent Persona
Moving beyond long-term syntactic control, a proper virtual assistant also requires semantic control. Not too long ago, asking a dialogue agent about its profession (What do you do? I am a 3rd grade teacher.) would yield one answer at the beginning of a conversation, but would quickly yield a different answer (I commute to the hospital each day as a doctor.) just a few turns later. A working system should avoid such contradictions and maintain a consistent personality. In addition to the power of large PLMs, works around unlikelihood training though have done a reasonable job at making sure the model remembers what it said about itself earlier in the conversation.
Memory Extensions
What about remembering what the other person said? While we’ve already discussed the ability of modern models to cover simple details within their latent state, a truly useful system should also be endowed with the ability to track discrete states. This serves at least two practical purposes. First, the main goal of any task-oriented dialogue system is to extract the proper slot-values from an utterance for policy decision-making. Predicted slot-values must be exact in order to execute API calls and need to be tracked precisely over time. Even in the open domain scenario, having a discrete state to represent the user’s preferences would be immensely valuable. Secondly, having a discrete state allows for the ability to inspect and modify such state for improved model control. If done before model deployment, we can consider initializing the model’s memory as a way of injecting some commonsense reasoning as a prior. A long line of work from FAIR has focused on this area, culminating with the release of BlenderBot3 which can not only remember discrete facts about both speakers, but can augment that knowledge by retrieving facts from the Internet.
Interactive Learning
Learning new skills after deployment is also an absolute requirement for working conversational agents since no matter how much a model knows or remembers while it is trained, the knowledge will become outdated over time. Being able to adapt, either through human feedback or other reward signals will be critical to staying relevant over time. More specifically, the difference between these human-in-the-loop systems is that the former provides an explicit signal through direct interaction, such as clicking a thumbs up / thumbs down button at the end of a chat. On the other hand, the much more scalable (but also noisy) option is to follow an implicit signal, such as measuring user satisfaction (with a separately trained model) or keeping track of user engagement (number of dialogue turns).
Ambiguity and Uncertainty
By satisfying all the criteria above, we end up with a model that is amazingly powerful not only in the short-term, but arguably gets stronger as time goes on. With that said, no dialogue system is ever perfect, so one more requirement is the ability to ask for clarification when it does not know. As the famous Donald Rumsfeld quote goes, “There are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns—the ones we don’t know we don’t know.” Ergo, a working system must learn to how to convert unknown unknowns into known unknowns.
Recognized unknown or out-of-distribution topics covers multiple fields, including abstention, uncertainty calibration, open set recognition, and out-of-scope detection. While tremendous progress has been made on this problem in general, not much of it has transferred over into the dialogue domain yet. In some sense, this may never be fully solved since, on occasion, conversations are just inherently ambiguous. As a result, a dialogue system should not aim towards the impossible task of resolving all ambiguity, but instead should endeavor to learn how to advance a conversation despite the uncertainty.
Low Data Regime
Last but not least, the raison d’être of the blog – dealing with limited and noisy data. Even when the dialogue system can do everything we imagine, there is no revolution if the cost of training such a model requires an inordinate amount of high-quality data. Pre-training certainly helps, but even for Large PLMs, there is evidence that data is the bottleneck. Data augmentation and synthetic text generation certainly has a role to play too. As previous articles have hinted at though, I believe we need to find dialogue-specific methods of data generation to solve this problem. Overall, I am quite hopeful that this last issue will also be solved well enough to bring around some real change.
Since all these components are at least somewhat satisfied, does that mean the revolution is right around the corner? Well, not quite. While the technology is mostly there, that doesn’t imply that you can build a business around it overnight. The ability to execute and adapt to the changing environment will play a big role in determining who ends up pulling this off, and when. The time seems soon though.