Network-based speech recognition accessed through an API
(Originally published in Speech Strategy News, August 2012)
AT&T Research (formerly Bell Labs) has been involved in speech technology research for many decades, for example, developing a continuous digit recognizer in the 50s. (See the interview with Mazin Gilbert, AVP of the Intelligent Systems Organization, AT&T Research, SSN, May 2012, p. 15.) The company’s speech recognition technology has found a home in some deployed applications, including the Vlingo voice assistant that is part of the new Samsung Galaxy III, with the assistant called S-Voice (SSN, July 2012, p. 1). (Vlingo is now part of Nuance, so the technology used in the Samsung phone lines may evolve eventually to Nuance technology.) Among other applications, AT&T speech technology has been used within AT&T for IVR customers for over 20 years.
The AT&T Watson speech technology has now been made available to developers as a network-based service accessed through an Application Programming Interface (API) that AT&T recently released. Gilbert summarized in a note to Speech Strategy News: “By exposing the speech APIs, we are lowering the barrier to entry for developers to empower their applications with speech. The responses we have received so far have been overwhelming. Our plans don’t stop here. We will continue to expose additional APIs and innovations to enable developers to create more advanced and personalized mobile applications ranging from virtual assistants to interactive gaming. Stay tuned!”
There is a registration charge of $99 for developers, Gilbert indicated, which will allow developers to use all AT&T APIs, including speech, as they become available, without a per transaction charge through 2012. Gilbert said that AT&T is working on pricing beyond 2012, and current projections have pricing at about one cent for most “small transcriptions.” He said AT&T will review pricing as we get closer to 2013, but he does not anticipate pricing “going anywhere but down.” (More detailed pricing information is available online. AT&T also sent out an eblast with a discount code that allows getting the API with the $99 fee waived through August.)
AT&T Watson is a network-based engine that integrates a variety of speech capabilities, including speaker-independent speech recognition, AT&T Natural Voices text-to-speech, speaker verification, natural language understanding, LLAMA-based machine learning, search, translation, and dialog management. AT&T says that the Watson speech engine continuously improves accuracy by learning different accents and speech patterns. WATSON can combine speech with other modalities, such as a touch-screen tap (“show me the closest coffee shop to here”) or other gesture (see figure). AT&T said in advertising material that AT&T has accumulated more than 600 patents on the AT&T Watson technology.
Watson uses a plugin architecture where each subtask is contained in its own plugin. Depending on the task to be performed, Watson selects the right plugins at run time, assembles them into a working engine, and coordinates the information exchange between the plugins. It also handles communication with the end device.
However, only speech recognition (speech-to-text transcription with Statistical Language Models that are tuned for specific “contexts”) is available initially with the current API. The API allows sending audio and receiving back text. AT&T indicated that native and HTML5-based Software Development Kits (SDKs) would be available “soon.”
The contexts make the speech recognition more accurate and also support specialized vocabularies, including:
- Generic speech-to-text (general dictation, automatically detects English or Spanish, and returns the appropriate text transcription);
- Web Search speech-to-text;
- Local Business Search speech-to-text;
- SMS (text message) speech-to-text;
- Question Transcription (converts questions to text); and
- TV Speech to Text (AT&T’s U-verse video programming guide).
The contexts are language models built, maintained, and tuned by AT&T.
AT&T is also offering the AT&T Application Resource Optimizer (ARO) as open source code. ARO is a free diagnostic tool that helps to optimize a mobile app’s performance, speed, network impact, and battery utilization.