Inside Siri's brain: The challenges of extending Apple's virtual assistant
Siri is, by far, my favorite recent addition to iOS. In an age when electronic devices keep getting smaller, faster, and thinner, the humble keyboard feels increasingly like a relic of a bygone era—an era when computers were designed to occupy entire rooms instead of the palm of a hand. Apple’s virtual assistant has changed the ways in which I interact with my mobile devices in a significant, if small, way.
Alas, Siri is frustratingly limited. It’s integrated well enough with Apple’s own apps, as well as with services that the company decides to support, but it’s otherwise nearly impossible to use with third-party software of any kind. That’s disappointing, because in such a context it could be a game changer—especially for people who have difficulty interacting with a normal keyboard because of a disability.
In an ideal world, Siri would be the primary way for me to interact with many aspects of my iPhone and iPad, and the keyboard would be available as a backup when needed. I’m sure plenty of developers would love to be able to take advantage of Siri, if only Apple would make it possible for them to do so. Unfortunately, the technology behind Siri makes that a significant challenge for the company.
There and back again
What we know as “Siri” is not just an app built into our phones and tablet, but rather a collection of software and Internet-based services that Apple operates in cooperation with a number of partners. By keeping the majority of the functionality server-side, the company can offload much of the work to powerful computers rather than taxing the limited resources of its mobile devices; plus, Apple can use the data it collects to continuously improve the service and offer new functionality without having to release a new version of iOS.
When you issue Siri a command, your device is mainly responsible for collecting the sound of your voice and converting it into an audio file, which it then sends to Apple’s data center for processing. This is not as trivial a task as it sounds—you’d be surprised at how much noise a microphone picks up, even when you’re in what seems like a quiet environment. For this reason, Apple has been investing heavily in technologies that make that sound as clear as possible: Most recent iOS devices feature multiple microphones, along with sophisticated hardware that analyzes the mics’ input to produce a signal scrubbed of most of its noise—a cleaner signal requires less data to transmit and is easier to process.
Once it reaches Apple’s servers, your audio file goes through a series of steps that progressively transform it into an action that a computer program can perform—such as figuring out what the weather looks like. The output of that action is then transformed back into text that can be read to you in a natural way.
Recognition and context
The first of these steps consists of converting your spoken words into text—a task that Apple reportedlydelegates to voice recognition powerhouse Nuance. Siri does a remarkable job here: Even with my Italian accent—“Noticeable,” as a friend once told me, “but you’ll never be picked as the next voice of Mario”—I find myself only rarely having to repeat a command.
However, Siri’s success at understanding me is possible only because it already “knows” the words I am likely to speak: The service uploads your contacts and other data about you so that it can recognize the information later on with a good degree of accuracy. Apple has programmed Siri to understand all the terms that are required to fulfill the tasks it supports, based on the context in which they are presented.
Due to the vagaries of human languages, this is not a simple problem to solve even with the most advanced technology. For example, the words byte and bite sound exactly the same, but a restaurant-review app is more likely to use the latter, while software destined for a technical audience will more often employ the former. Confusing the two could lead to a dead-wrong interpretation of the resulting text: Nobody wants a few chips of RAM with their dark rye sandwich, but a computer has no concept of the absurd.
In order to allow third parties to take advantage of Siri, Apple would have to figure out a way for developers to “teach” the service about the specific terminology that their software is going to use, and the context in which it is going to be used. As you can imagine, this would be difficult even for simple apps, and nearly impossible for others, particularly if they deal with complex concepts that do not lend themselves well to vocalization.
From words to concepts
Once voice is turned into text, Siri’s next job consists of understanding what the user is asking for, a process that relies on an area of science callednatural language processing. If you thought voice recognition was difficult, this is many, many times harder, because humans have a nearly unlimited ability to express any given concept using endless combinations of words, and they often say one thing when they really mean another.
To tackle this problem, a natural language system like Siri usually starts by attempting to parse the syntactical structure of a piece of text, extracting things like nouns, adjectives, and verbs, as well as the general intonation of the sentences. That helps Siri determine, for example, whether the text is a question, or whether the person is phrasing things in a way that sounds like they are upset or excited.
Assuming that the user has a passable command of their chosen language, this is usually a relatively easy problem to handle. The hard part comes when all those words have to be turned into some sort of actionable content that an app can process; to do this well, the system must have what is called domain knowledge—in other words, it must know the subject area you’re talking about.
You’ve likely encountered a similar problem when asked to deal with a body of knowledge you’re unfamiliar with: Your doctor, for example, may tell you that you need to be treated for dyspepsia, but unless you are a medical professional, you probably won’t know that you just have indigestion and need an antacid or two. Apple would have to come up with a way for developers to explain to Siri what their apps can do, and provide all the appropriate terminology for those actions.
Of all the parts that make up Siri, this natural language analysis is probably the toughest for developers to tackle, because apps differ greatly, and it’s hard to come up with a magic solution that can easily be applied to every possible situation. To make things worse, natural language analysis is not a familiar field for most programmers—who, until now, have mainly been concerned with point-and-click (or point-and-tap) interfaces.
Putting results to text
Once a request has been processed, Siri must convert the result back into text that can be spoken to the user. While not as hard as processing a user’s commands, this task, known asnatural language generation, still presents some challenges.
It’s relatively easy to write software that uses data to cobble together syntactically correct sentences, but, without some hard work, the result is likely to sound artificial and unexciting. When you ask Siri about the weather, for example, instead of just rattling out a list of statistics on temperature, pressure, and cloud cover, the service will give you a generic comment, such as “It’s sunny” or “It looks like rain.”
This touch of personality may seem unimportant, but it a makes a big difference to a user, particularly during verbal communication. Luckily, there is a well-defined body of work that puts this capability well within the reach of most app developers. Even better, there is no need for this final portion of the Siri experience to take place on the server side; instead, Apple could conceivably come up with a technology that standardizes the creation of complex text, and then leave it to the apps to produce a response directly on each device without unduly taxing resources.
Siri for everyone
Allowing third-party apps to integrate with Siri would be a boon for both developers and users, but it’s going to require a lot of effort for everyone involved, in large part because it would represent a significant departure from the way we are used to designing and interacting with our software.
Still, it’s fair to say that the company is quietly laying the groundwork for putting more and more Siri-like capabilities within the reach of every programmer, starting with Apple’s ever-increasing investment in the back-end facilities that it needs to run Siri’s complex infrastructure.
For example, dictation is now built into both of Apple’s operating systems, though developers are not currently allowed to add their own specialized jargon to the vocabulary. Similarly, both OS X and iOS have recently acquired several programming interfaces that can be used to analyze the syntax of a text document, although they do not help much in the much harder task of interpreting its meaning. Finally, Apple’s software has long been adept at speech synthesis; right now this capability is used primarily by system tools like VoiceOver and is off-limits to developers (at least on iOS), but it wouldn’t take much work, from a technical perspective, to turn speech synthesis into a general-purpose tool that everyone could use.
Ultimately, the shift toward natural language interaction is all but inevitable, and the keyboard, while not likely to disappear anytime soon, is going to become less and less relevant. The switch to a voice-based interface is going to be a hard one, with plenty of obstacles along the way; still, I look forward to the day when I will finally be able to stop typing on my devices and start communicating with them.