As I tinker with dialog systems at the Allen Institute for Artificial Intelligence, primarily by prototyping Alexa skills, I often wonder what AI is still lacking to build good conversational systems, punting the social challenge to another day. This post is my take on where AI has a good chance to improve and consequently, what we can expect from the next wave of conversational systems.
It won’t be an easy march though once we get to the nitty-gritty details. For example, I heard through the grapevine that when Starbucks looked at the voice data they collected from customer orders, they found that there are a few millions unique ways to order. (For those in the field, I’m talking about unique user utterances.) This is to be expected given the wild combinations of latte vs mocha, dairy vs soy, grande vs trenta, extra-hot vs iced, room vs no-room, for here vs to-go, snack variety, spoken accent diversity, etc. The AI practitioner will soon curse all these dimensions before taking a deep learning breath and getting to work. I feel though that given practically unlimited data, deep learning is now good enough to overcome this problem, and it is only a matter of couple of years until we see these TODA solutions deployed. One technique to watch is Generative Adversarial Nets (GAN). Roughly speaking, GAN engages itself in an iterative game of counterfeiting real stuffs, getting caught by the police neural network, improving counterfeiting skill, and rinse-and-repeating until it can pass as your Starbucks’ order-taking person, given enough data and iterations.
The sentiment analysis in machine learning uses language analytics to determine the attitude or emotional state of whom they are speaking to in any given situation. This has proven to be difficult for even the most advanced chatbot due to an inability to detect certain questions and comments from context. Developers are creating these bots to automate a wider range of processes in an increasingly human-like way and to continue to develop and learn over time.
This is a lot less complicated than it appears. Given a set of sentences, each belonging to a class, and a new input sentence, we can count the occurrence of each word in each class, account for its commonality and assign each class a score. Factoring for commonality is important: matching the word “it” is considerably less meaningful than a match for the word “cheese”. The class with the highest score is the one most likely to belong to the input sentence. This is a slight oversimplification as words need to be reduced to their stems, but you get the basic idea.
Derived from “chat robot”, "chatbots" allow for highly engaging, conversational experiences, through voice and text, that can be customized and used on mobile devices, web browsers, and on popular chat platforms such as Facebook Messenger, or Slack. With the advent of deep learning technologies such as text-to-speech, automatic speech recognition, and natural language processing, chatbots that simulate human conversation and dialogue can now be found in call center and customer service workflows, DevOps management, and as personal assistants.
A chatbot works in a couple of ways: set guidelines and machine learning. A chatbot that functions with a set of guidelines in place is limited in its conversation. It can only respond to a set number of requests and vocabulary, and is only as intelligent as its programming code. An example of a limited bot is an automated banking bot that asks the caller some questions to understand what the caller wants done. The bot would make a command like “Please tell me what I can do for you by saying account balances, account transfer, or bill payment.” If the customer responds with "credit card balance," the bot would not understand the request and would proceed to either repeat the command or transfer the caller to a human assistant.
There are situations for chatbots, however, if you are able to recognize the limitations of chatbot technology. The real value from chatbots come from limited workflows such as a simple question and answer or trigger and action functionality, and that’s where the technology is really shining. People tend to want to find answers without the need to talk to a real person, so organizations are enabling their customers to seek help how they please. Mastercard allows users to check in with their accounts by messaging its respective bot. Whole Foods uses a chatbot for its customers to easily surface recipes, and Staples partnered with IBM to create a chatbot to answer general customer inquiries about orders, products and more.
3. Now, since ours is a conversational AI bot, we need to keep track of the conversations happened thus far, to predict an appropriate response. For this purpose, we need a dictionary object that can be persisted with information about the current intent, current entities, persisted information that user would have provided to bot’s previous questions, bot’s previous action, results of the API call (if any). This information will constitute our input X, the feature vector. The target y, that the dialogue model is going to be trained upon will be ‘next_action’ (The next_action can simply be a one-hot encoded vector corresponding to each actions that we define in our training data).
In 1950, Alan Turing's famous article "Computing Machinery and Intelligence" was published,[7] which proposed what is now called the Turing test as a criterion of intelligence. This criterion depends on the ability of a computer program to impersonate a human in a real-time written conversation with a human judge, sufficiently well that the judge is unable to distinguish reliably—on the basis of the conversational content alone—between the program and a real human. The notoriety of Turing's proposed test stimulated great interest in Joseph Weizenbaum's program ELIZA, published in 1966, which seemed to be able to fool users into believing that they were conversing with a real human. However Weizenbaum himself did not claim that ELIZA was genuinely intelligent, and the introduction to his paper presented it more as a debunking exercise:
×