What does it take for mobile personal assistants to “understand” us?

Discussions of computers understanding human language always seem to evoke “artificial intelligence,” along with science fiction images of powerful computers challenging humans. Apple’s Siri interprets speech to try to understand the goal of the mobile phone owner, and does a remarkable job in some contexts. By interpreting speech, rather than just displaying the words spoken, it goes beyond pure speech recognition. What does the understanding? How far can this technology evolve?

A December analyst briefing by Nuance Communications (which provides the speech recognition part of Siri) included a figure (below) that shows potential pieces of technology that can contribute to a system that responds intelligently to “natural language”–spoken or typed. The limitations in interpreting natural language are suggested by the boxes in the figure labeled “Statistical Training” and “Knowledge Representation.” The statistical training uses examples of speech or text where the correct interpretation is known because it has been identified by humans: both the speech-to-text transcription and text interpretation use examples of the “right answer” to extrapolate to other cases. The methodology does not attempt to emulate the way our minds work, but delivers what might be called “embedded intelligence,” extrapolating from examples provided by humans. Doing this effectively requires many examples and a lot of computer power.

Computer processing can thus interpret human language effectively in domains where it has sufficient examples. Computer intelligence is certainly not as flexible as human intelligence, and a computer doesn’t “understand” what it is interpreting in the same sense humans do. Nuance uses the term “Application Intelligence,” perhaps a repositioning of “AI.” Vlad Sejnoha, Chief Technology Officer, Nuance, noted in the December briefing that natural language understanding is not just an input technology, but improves output—communicating with the user—through better composition of the message in dialog and through more natural-sounding text-to-speech synthesis.

Understanding what we want is only half the battle. “Knowledge representation” in the figure refers to computers helping us with the task of converting raw information into insights. Representing knowledge in a compact and accessible form is critical to getting us the answer we seek, particularly in a world where the quantity of information (and misinformation) is exploding.

Any given human may have deep knowledge and understanding of specific subjects based on his or her interests, vocation, and education. But the collective intelligence of humans is much too broad for any individual to know it all. Inventions such as the World Wide Web and search engines allow us to use computers as a tool to expand our ability to access this valuable information. But, as these information sources grow, it becomes a significant task to extract answers from raw information.

The particular approach that Nuance is incorporating (in an upcoming service they call Prodigy) is based in part on work the company is doing with IBM and using some of the technology behind IBM’s Watson. The Watson technology was publicized when the computer was a contestant on the TV show Jeopardy and beat two champion contestants. The key capability of Watson technology is analyzing multiple text databases so that the information they contain can be organized, integrated, annotated by context, and accessed more directly and quickly than just searching text for keywords. The resulting representation of the knowledge in that data allows matching it to the result of natural language processing of questions. There is of course a bridge between input interpretation and knowledge representation that restates the inquiry in a form the knowledge representation component can use. Nuance’s diagram indicates the many components that can contribute to this process.

On the knowledge representation side, IBM has created the Strategic IP Insight Platform, or SIIP, for a healthcare application. SIIP scans pharmaceutical patents and biomedical journals to discover and analyze information pertaining to drug discovery. IBM has cataloged 2.5 million chemical compounds from 4.7 million patents and 11 million journal articles between 1976 and 2000. The technology goes well beyond classical keyword search by using context and natural language interpretation to create focused results.

Nuance CEO Paul Ricci emphasized this broad view of natural language processing—both text and speech—as levers for the company’s growth in 2012, with a new generation of natural language solutions, virtual assistants like Siri in mobile (e.g., Nuance’s Dragon Go!), and cloud computing (performing the most advanced of these capabilities as a network-based service). Ricci characterized the emphasis as “moving from recognition to outcomes.”

Sejnoha, in the same briefing, claimed that natural language understanding has become powerful enough that it’s now possible for a user to say a single utterance and have the software find the appropriate application (e.g., web search), launch it, enter the required text into it, and get the response from the application. An example he gave was medical: “Is a hormone deficiency associated with Kallman’s syndrome?” was answered, “Yes. A deficiency of GnRH is associated with Kallman’s syndrome” (with source evidence listed). Another example is included in the figure above. I will summarize this one-step goal of natural language processing as Direct-To-Content (DTC).

While Nuance is the most vocal company about this evolving goal of natural language understanding, Apple’s Siri uses this approach, trying to provide the answer to a spoken request as directly and succinctly as possible. This DTC is most evident with Siri when the content is in an Apple application such as the built-in reminder/calendar program, but also applies to Web searches.

Google recognizes this challenge. Google executive chairman Eric Schmidt, in a letter to the Senate Subcommittee on Antitrust, Competition Policy and Consumer Rights, said Apple’s Siri is a threat to his company’s search business, especially on mobile. He writes: “Apple’s Siri is a significant development–a voice-activated means of accessing answers through iPhones that demonstrates the innovations in search…History shows that popular technology is often supplanted by entirely new models.” He referred to Siri as a “search and task-completion service.”

And Microsoft has both the speech and natural language technology in-house to expand its search activity into the DTC model. I expect we will hear more from Microsoft about its cloud-based Tellme operation this month.

Sejnoha emphasized that these algorithms are increasingly statistical. The techniques can automatically discover patterns and get better and more accurate over time. Historically, he noted, natural language understanding systems were mostly hand-built, with engineers and linguists sitting down and writing grammar rules, resulting in systems that didn’t treat the less common cases well. He said that machine learning and statistical pattern recognition techniques have substantial advantages when adequate data is available, as is the case today.

So, on the one hand, today’s speech recognition and natural language processing is not comparable to humans in terms of true understanding, and the methods used do not attempt to mimic the way human brains operate. On the other hand, the computer-based methods can take advantage of computer technology’s almost unlimited ability to store and analyze large bodies of information to provide us a tool that extends human capabilities. A mobile personal assistant with access to that power will become something we don’t want to do without.


3 thoughts on “What does it take for mobile personal assistants to “understand” us?

  1. Thanks for the Siri review.

    As we all know, science advances by finding the failures of our models, rather than in celebrating our successes. It would be interesting to review and catalog the failures of Siri, and to examine the model’s “statistical” solutions to see how far it can be pushed. The question is really whether Siri is a smart search toy, or an advance in our understanding.


  2. Bill,

    It is indeed interesting to see how speech recognition for input is becoming logically separated from output, which can also be voice or visual (text, graphics, etc.). The rapid adoption of multi-modal mobile devices for both communication and information is driving the need for having such choices in user interfaces, depending upon the individual end user’s environmental circumstances or personal preferences.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>