A key sub-issue in designing conversational agents is being able to reliably calculate uncertainty over the model’s beliefs. In doing so, the model would be able to recognize when it does not understand something, and appropriately ask for clarification. Thus, we can imagine the output of an uncertainty model feeding into a dialogue policy manager which then decides to either retrieve an answer from the knowledge base when it feels fairly certain it knows what the customer wants, or to ask a follow-up question when it is unsure. From a information-theory point of view, this can be seen as a model which asks questions to minimize entropy until it reaches a certain threshold, at which point it will return an answer. Beyond improving model behavior, measuring uncertainty also gives a view into how the model is thinking for improved debugging and enhanced interpretability.
A sensible way to approach such as problem is to lean on Bayesian methods which naturally offer a distribution over its posterior beliefs. As such, it might make sense to tackle this subject using Gaussian Processes. A Gaussian Process is a probability distribution over a number of possible functions that fit a set of points. In simplified terms, any point from your input X can be mapped to a corresponding label Y which is its own Gaussian variable. For points that come from your dataset x, the variance of the y is relatively low since you know actual values that y can take on. For values x̂ that are far away from your dataset, you must extrapolate in order to calculate ŷ, which should lead to Gaussian variables with high variance. For values x* that are in between the values found in your dataset, you would expect the uncertainty of y* to be somewhere in the middle since you are merely interpolating. Overall, each one of these points requires their own Gaussian, and thus you end up with a model composed of a multi-variate Gaussian with infinite dimensions.
Of course, you can’t perform inference on an infinite number of Gaussians, so we use the kernel trick to approximate the co-variance matrix. In slightly more detail, recall that a multivariate Gaussian can be fully described by a m-dimensional vector of means and a m x m co-variance matrix. If m goes towards infinity, then this matrix can be described by a kernel K(x_i, x_j). (For more details, a simple search will return many results, my favorites: Distill publication, ICL lecture video or Stanford lecture notes). Next, to perform inference, you can use standard equations to extract the information you know from your dataset, typically using Cholesky decomposition, which is then used to make predictions about the unknown data. Unfortunately, performing this calculation requires inverting a matrix the size of your dataset, which operates at a speed of O(n^3). This can work for small problems, but quickly becomes intractable for larger datasets containing hundreds of thousands of examples. In addition to being slow, GP also has hyper-parameters to tune (namely σ and l) that require some domain knowledge. There is also the problem of choosing the right kernel in the first place to serve as the prior. Finally, Gaussian Processes require a bit of manipulation to get them to work for classification and RL problems.
In any case, the research community has shifted towards modeling the world through deep learning, which require new ideas for calculating uncertainty because neural networks are trained with gradient descent. Luckily, the commonplace tool of dropout can be easily adapted to fit our needs in this situation. More concretely, suppose we have a existing model that has already been trained to convergence. Then, during test time, we simply perturb the model using dropout for the various inputs to generate random samples of the model. In this sense, we get an ensemble of models that can be averaged to give a higher confidence prediction. More importantly, in addition to a mean, the predictions from this set of models has variance that can be empirically calculated (Details here). I think the key insight here is that we are not perturbing the inputs to generate noise, but rather sampling a distribution of models. Intuitively, this is equivalent to the distribution of functions in GP being used to fit the data. Mathematically, this can be seen as equivalent by reviewing the derivations found in the paper by Gal and Ghahramani: https://arxiv.org/abs/1506.02142.
Even with this method though, we still face at least a number of considerable complications before being able have dialogue models that are able to reason about uncertainty. To start, recall that our predicted belief represents a distribution over user intents, and thus we must assume a finite ontology of intents already exists that can properly approximate the meaning of utterances. This is a non-trivial assumption, but also outside the scope of this blog post. However, even assuming that the previous conditions are met, the dropout method still might not be sufficient because we must run a full inference pass each time just to get one sample. Next, to get an accurate estimate of the variance for one class, there might need thousands of samples. Then, to get an accurate estimate for all n classes would need n-thousand samples, where for real life problems n itself might be around a thousand. If we view our semantic space as continuous, then this sampling method isn’t even tractable. Perhaps we could measure our uncertainty instead as the tightness of the bounds of the samples, and anything beyond a certain range would be considered “low certainty”.
With that said, note that what we really want is a tool for measuring the uncertainty over user intents in the semantic space. More specifically, we don’t just need a system where its predictions are calibrated to match the likelihood; what we really want is a system where its predictions have a semantic meaning. In other words, we have been viewing uncertainty as a single number assigned to each class, but perhaps we should be viewing certainty as a point within an embedding space. So for example, in the restaurant domain, when the model assigns high probability to Japanese food, it also raises the probability of Chinese and Korean food because these items are more closely grouped together in the “Asian food” cluster. At the same time if the entity that triggered the prediction was “fish”, then maybe Japanese (sashimi) and Mexican (fish tacos) should concurrently increase, while the prediction shifts away from the Korean node in the embedding space. Consequently, notice this immediately invalidates any Bayesian Neural Network or Bayes by Backprop approaches that capture uncertainty over the weights of the network rather than uncertainty over the understanding.
In this sense, rather than attaching an uncertainty score to each intent in the ontology, what we needed is an embedding space of intents that still offers a measure of uncertainty. Then, we also need a way to calculate it.