Some notes to remember when building intelligent task oriented dialogue agents:
- Modularity is important. While E2E response generation is good, intepretability is better.
- There is a draw to simply use a Seq2Seq approach with an encoder for reading user input and decoder for generating output. However, the output becomes more of a language model, and fails to perform reasoning.
- Moreover, the hidden state is not human readable. Instead, this should be broken down into Intent Tracking, Policy Management and Text Generation, which allows for better interpretability.
- Additionally, modularity is easier to maintain and easier to delegate duties when operating in a realistic industry setting.
- Important not to forget about optimizing auxiliary components
- Intent tracker should include context embedder in addition to utterance embedder
- Intent tracker includes memory cells, likely formulated as a Recurrent Entity Network or Neural Process Network
- Policy manager includes soft knowledge base query mechanism
- Evaluation and data processing should be optimized
- Intent tracking is a set of binary predictors
- Multi-intent utterances occur often, even multiple slots of the same dialogue act are fairly common, such as “I would like to eat Chinese or Korean food.”
- The model actually works better since each task is now much simpler (watch for if the user wants Chinese food, rather than watch for what the user wants)
- This has been shown to work well in practice (from start-up)
- Intent output should be act(slot-relation-value):
- for example: inform(food = korean), request(address = the_missing_sock), inform(rating > 3), accept(offer = the_missing_sock), inform(date > today), answer(confirm = yes)
- between two values (such as price range) can be written as inform(price > 3) and inform(price < 6)
- this is all possible because the binary predictors allow for arbitrary combinations
- semantic parsing is overly complex (hard for machines to perform and hard for people to interpret), also does not necessarily give better information to the policy manager
- Dialogue Acts are five pairs of items which constitutes a MECE set
- request/inform
- open/close
- accept/reject
- question/answer
- acknow/confuse
- MECE = mutually exclusive, collectively exhaustive
- Full Dialogue State (to be fed into RL agent) includes
- Five items:
- previous agent actions
- current user intent
- full frame of possible slots-value pairs
- turn count
- KB results - Context vector is stored for Intent Tracker, but not for Policy Manager - Markov property that previous information, such as the order of past “informs” is not needed
- In order to measure uncertainty, distributed soft approximation of dialogue state is necessary
- memory stored as neural embedding
- a pure softmax has been shown to be overly confident, more research is needed on how to better measure “uncertainty”
- In order to increase accuracy, model should ask for clarification:
- conventional clarification request (question paraphrase) - what did you want?
- partial clarification requests (ask for relevant knowledge) - what was the area you mentioned?
- confirmation through mention of alternatives (knowledge verification) - did you say the north part of town?
- reformulation of information (question verification) - so basically you want asian food, right?
- Good dialogue models have the following attributes
- works across multiples turns, which distinguishes it from QA bots
- works with a knowledge base, which distinguishes it from chatbots
- knows whether to clarify and what type of clarification to employ using expected entropy maximization objective (ie. it does not ask irrelevant questions and annoy the user)
- Covers majority of real world scenarios through use of user simulator capable of generating novel examples
- user simulator allows for fast training, since real users are expensive in time and money
- user simulator should be dynamic, meaning it should be trainable itself
- user simulator should output realistic user utterance through use of a GAN which discriminates against model generated text
- user simulator should be smart about switching between offering real text vs generated text as training progresses