Jeopardy! as a Modern Turing Test: Did Watson Really Win?

| TBS Staff

Are you ready to discover your college program?

1. Introduction

On January 14, 2011, a computer, IBM Watson, outplayed two humans at the popular television quiz show Jeopardy! The computer's opponents weren't merely two humans—they were the two all-time best Jeopardy! champions, ever. IBM Watson beat them both in a live, real-time competition—no tricks, no gimmicks. Watson analyzed the clues in a real game of Jeopardy, in real-time game competition, and providing the answers according to the rules of the game. The event was held at Yorktown Heights in New York, at IBM Research.

In 1997, IBM had made the news with its Deep Blue supercomputer, which beat then World Chess Champion Gary Kasparov in a live, televised match. The Deep Blue victory created a sensation, as chess was widely thought to represent human intelligence, and many felt that the “crossover” point of a computer beating a human at chess was still years—perhaps decades—away. Yet for all the shock of the Deep Blue victory over a seemingly invincible human opponent, what happened in Yorktown Heights in January of 2011 has proven more significant.

Almost five years later, Watson has become a major player in IBM's new Cognitive Computing research, and it has expanded to become a general platform for analyzing unstructured data, like text. Watson is now deployed in the medical domain, to analyze vast repositories of textual data and help doctors and medical practitioners make more informed decisions—to answer questions they have, that are difficult to answer without its help. The IBM brand has an enormous investment in Watson—we even see Watson in television commercials bantering with the likes of Bob Dylan! Watson, perhaps unlike Deep Blue, is here to stay.

But to tell the story of how Watson was developed and deployed against humans on a difficult, natural language-based task (namely, understanding textual clues and matching them to answers in the game of Jeopardy!), we first need to understand the challenges of open-domain question-answering (QA). For Watson, in the end, was a system developed for QA—a challenging problem in computer science and Artificial Intelligence (AI) with a long history, and much recent interest given the success of the Web and the unstructured textual and other data sources the Web has made available.

Given terabytes of Web pages like Wikipedia and other knowledge sources, how does one “crunch” all this unstructured data to reliably answer questions? If the answer to a question lies in, say, a Wikipedia page—how does one get a computer to search it for the answer? It seems the computer must somehow “understand” natural language. And the natural language processing (NLP) task is notoriously difficult for computers.

Unlike Deep Blue, a kind of excitement about Watson exists because it seems to excel at a task that is not so easily reduced to logical rules, like chess. Watson seems to excel at processing natural language: the very same language that we “process” when we talk to each other, read the newspaper, and go about our daily lives. And so it's to a brief primer on natural language understanding as it applies to the QA task that we turn next.

2. Question-Answering and Natural Language Understanding

QA, or question-answering, is the task of providing an answer, given a question. For instance, a question such as “President Obama began his first term as President in what year?” requires a computer to parse the sentence, zero in on what is asked, figure out what searches to perform to answer what is asked, and finally to produce an answer representing its best “guess” or highest statistical confidence in its accuracy.

The QA task has a long pedigree in advanced computational research such as the field of AI. It has enormous practical, business applications in the area of search and retrieval (imagine a search engine that didn't just produce a list of pages, but rather could answer specific questions you posed to it), and requires advances in a number of important areas in AI: information retrieval (IR), natural language processing (NLP), machine learning, knowledge representation and reasoning (KRR), and human computer interfaces (HCI). QA, in other words, is a hard problem for AI, and success on it translates into important applications for everyday use (search), as well as important advances in a number of key areas of research in AI.

The key to understanding both the importance as well as the difficulty of QA is in the input to many QA systems—unstructured information. Most of human communication—in the form of writing, speech, or images—is unstructured. Web pages, for instance, are largely textual sources and images. Such information is easily understood by humans (assuming one speaks the language), but is notoriously difficult for computers to process.

Yet the success of the Web has resulted in an explosion of unstructured information, and so the incentive to develop computer systems that can better process unstructured information is only increasing. Add to this, the rate of growth of unstructured information exceeds that of structured resources like relational databases. The “problem” of unstructured content is growing, increasing. At any rate, the world of communication is not in a database. It's on the Web, and the Web is largely unstructured.

To tack this discussion down a bit, consider a simple textual source such as a blog. Blogs are roughly fifteen years old, and the numbers of blogs on the Web continue to explode. So many blogs are created on the World Wide Web each day that it is difficult to say exactly how many there are today. In 2011—the age of dinosaurs in the ever-changing landscape of the Web—there were over 156 million public blogs on the Web. By 2014, there were 172 million Tumblr blogs alone—and the number keeps increasing.

For QA, the relevant point is this: how many questions could we answer if we could just search, and process, the hundreds of millions of blogs on the Web alone? At stake here are of course issues with reliability, verification, and so on. But even given these considerations, it's clear that a vast amount of unstructured information exists just in blogs. If this information could be effectively harnessed by computer systems, application areas such as open-domain QA would take a quantum leap forward. But how can computers “read” blogs? This is the key to understanding the challenges with QA, and to understanding the significance of Watson.

NLP techniques, also known as text analytics, convert unstructured text into usable information for computers by analyzing natural language in terms of syntax, semantics, context (pragmatics), and usage patterns (statistical analyses of language). The goal with NLP is to render unstructured textual information as structured, computer-usable information, all the while preserving the meaning and intent of the language expressions in the text.

NLP is nearly as old as the field of AI itself. Decades of research have led to well-defined areas of study within its purview, including (shallow and deep) syntactical parsing, information extraction techniques such as named entity recognition (NER), word sense disambiguation (WSD), and semantic analysis techniques such as extracting the predicate-argument structure (PAS) of grammatical sentences, as well as extracting relations between entities referred to in a sentence, such as “Friend Of” as in “Bob is a friend of John.”

NLP is at the core of such tasks as language translation, language interpretation (“understanding” a discourse or pieces of a discourse such as paragraphs or sentences or clauses), language generation (producing correct discourse in dialogue like conversation or in monologue such as automatically generated newswire text), and others. The well-known Turing Test, where a computer dialogues with a person via text or teletype and “passes” the test if the person cannot tell the difference between the computer and another person, would, if it could be realized, constitute a very complete example of natural language processing capabilities in a digital computer.

Yet, the Turing Test has, to date, not been passed. To be sure, computers have been deliberately programmed to fool people into thinking that they are human by exhibiting certain idiosyncratic traits (such as pretending to be a young, non-native English speaker). But no computer purporting to be a well-educated adult English speaker has come close to passing the test. Accordingly, some argue that the test is so inherently difficult that we shouldn't expect a computer to pass it for a very long time (if ever). What gives? Why is NLP so hard? (And: what can Watson do about it?)

As IBM's David Ferrucci points out, writing in the IBM Journal of Research & Development, our “interpretations of text can be heavily influenced by personal background knowledge about the topic or about the writer or about when the content was expressed.” Context is king, in other words. What do words mean in isolation? What do sentences mean in the context of a paragraph, of a discourse? What does an entire discourse—say, a newspaper story—mean apart from its context in a particular culture, at a particular time? It seems that language embodies our minds—our intentions, our interactions, our hidden and unspoken assumptions and shared knowledge. That is why NLP is hard, as Ferrucci points out.

Ferrucci should know: he is the head of the Watson project at IBM. In discussing the challenge to the IBM team presented by Watson, Ferrucci makes clear that a different approach to NLP would be required to make substantive progress on Watson. He continues:

Consider the expression “That play was bad!” What sort of thing does “play” refer to? A play on Broadway? A football play? Does “bad” mean “good” in this sentence? Clearly, more context is needed to interpret the intended meaning accurately.

If context is king, however, then the question of advancing open-domain QA becomes the challenge of how to handle context better. Ferrucci approached this problem by asking an interesting question (paraphrasing): “If there are limits to specific NLP techniques, how might a system be built that produces a superior result by combining those limited subsystems and techniques?” “How accurate does, say, a parser have to be, if the entire system works on the natural language texts to produce an eventual answer?”

Ferrucci was asking about systems, not techniques. And Watson, perhaps more than any NLP system or QA system ever, is an entire system for processing natural language. We turn next to Watson, the system, itself. More specifically, we turn to the “engine” of Watson—the part of the Watson system that processes natural language, called DeepQA: Deep Question Answering. But first, the precursor to this engine, namely, PIQUANT.

3. Lessons from PIQUANT: The Society of Minds

Watson was prefigured by an earlier IBM system known as PIQUANT. PIQUANT was a system developed in 1999 by IBM to perform open-domain QA. It was funded by government grants, and competed in the well-known Text REtrieval Conferences (TREC) sponsored by the National Institute of Standards and Technology (NIST), which provided datasets for purposes of providing a benchmark to compare system performances in TREC.

PIQUANT was, at the time, a state-of-the-art system that answered factoid questions using a pre-determined set of answer types, and a single “ontology” or knowledge repository (semantic terms stored in a database, like “Person,” “Organization,” “Place,” and so on) to map questions and text resources into this knowledge repository, to connect questions with plausible answers. PIQUANT was consistently a top performer in TREC, but it had a number of limitations that made it unsuitable for more complicated QA tasks like playing the game of Jeopardy!

One limitation was the static set of answer types (e.g., people, places, dates, numbers), which was too narrow to cover the much broader domain covered by Jeopardy questions. Another limitation was this: because there is a strict (monetary) penalty for incorrect answers in Jeopardy!, a premium must be placed on the system's confidence in the correctness of particular answers given a question during game play. If a system is not confident of the correctness of an answer, it should not “buzz in” or answer at all, as this may result in a penalty. Hence the importance of knowing-what-one-knows, or in knowing roughly how likely a given answer is in fact the correct answer, makes Jeopardy! play vastly more demanding than traditional open-domain QA competitions like TREC.

As a third limitation, PIQUANT was a designed and deployed system having a relatively static or fixed capability. By contrast, the complexity of playing Jeopardy! seemed to require a fully extensible system that could be easily modified, improved, and tested on a game by game basis. Watson had to “grow” into the game of Jeopardy! by repeated trial and error, while PIQUANT was a one-size-fits-all QA system that. As a good performing system on TREC, PIQUANT could not be suitably extended. In short, the Watson system would need an entirely different approach.

The late AI pioneer Marvin Minsky wrote a now-famous book about the nature of computational intelligence titled The Society of Mind. His point in this classic AI text, published in 1988, is that there is no central “CPU” for human intelligence. Our minds are rather composed of scores of little “minds” that are concerned with different parts of a whole. It is by combining these little “minds” together, each performing some task and essentially oblivious to others, that a full picture of human intelligence or thought emerges, according to Minsky.

While Minsky's speculations about the ultimate nature of the human mind can be debated, his insights were pivotal to Ferrucci and the IBM Research team who developed Watson. Watson is, quite literally, a “society” of separate and massively parallel processing components, all combined in a pipeline that takes at one end a (Jeopardy!) question, and produces at the other end the most likely answer given the evidence available to it.

Watson may not be a society of “minds,” but it follows Minsky's broad brush idea that intelligence must be a decomposable network of interacting subsystems, and not a monolithic homunculus or “central processing unit” that enforces a single interpretation on input data. Language requires context; context, according to Ferrucci and the IBM team, required a “society of minds” approach to bring it to heel computationally.

4. Watson is Born

In 2004, Jeopardy! champion Ken Jennings won a record 74 Jeopardy! games in a row, bringing the popular quiz show even further into the media spotlight. The head of IBM Research began challenging IBM researchers working on PIQUANT to develop an open-domain QA system that could compete at Jeopardy!—the ultimate QA challenge, and like the prior public excitement about chess playing Deep Blue, one that promised to shine a media spotlight on IBM. IBM has a penchant for advancing its research in the public eye, and a QA system that could compete against a star like Jennings seemed the perfect IBM challenge. But could it be done?

By April 2007, after extensive research into the feasibility of a system like Watson, IBM Research committed to the challenge of building an open-domain QA system that could win at Jeopardy!—even against champions like Jennings. IBM was to build a superhuman contestant on Jeopardy! in three to five years. The IBM Watson project was born.

Immediately, there were a number of technical challenges to address. For one, the Watson system would have to behave like an actual Jeopardy! contestant. Like human contestants, the system could not access the Web for answers during competition, and would be subject to the same time constraints and other factors involved in actually playing Jeopardy! For instance, it would have to “buzz in” with an answer whenever it was confident its best answer was indeed correct, and it would have to make strategic wagers for Daily Doubles, and for Final Jeopardy—both parts of the game where players wager their earnings and can win, or lose, big.

In short, Watson would have to analyze Jeopardy! questions written in natural language, finding answers in its available knowledge resources. Moreover, it would be required to mimic a human contestant in order to actually “play” the game. But for Watson to do any of this, it would first have to produce answers to actual Jeopardy! questions. This was NLP, or more specifically open-domain QA, and nothing of the scale or complexity of playing Jeopardy! had ever been attempted before. The IBM researchers would need to understand the game and the current human level of game play in order to properly develop the system. So they turned, naturally, to looking at Jeopardy! contestants and their performances first.

IBM Research (hereafter “the Watson team”) gathered about 2,000 actual Jeopardy! games, analyzing them to determine a benchmark set of statistics for Watson. They found that winning human players in Jeopardy! attempt on average about 40 – 50% of questions offered in a game and get about 85 – 95% correct. For Watson to be competitive against winning players, it would have to meet or exceed these figures.

The Watson team's first target was to raise the number of questions attempted: 70% of questions attempted at 85% accuracy, abbreviated 85% Precision@70. The other target—necessary for playing Jeopardy! in real-time—was speed of answer. On average, human players answered a question in less than six seconds after the question displayed (almost immediately after the “buzzers” are enabled), with a mean time of three seconds.

Playing Jeopardy! accurately, making smart decisions about its own confidence in its answers, and playing the game in real-time were major challenges for the Watson team as they began development on Watson in 2007. In what follows, we'll take a look at the actual construction of the system to see how the Watson team addressed these challenges.

5. The Watson Architecture

IBM Research, to facilitate natural language processing research and development, built the Unstructured Information Management Architecture (UIMA). Developed between 2001 and 2006, UIMA is a framework for building applications to process text and natural language. Specific algorithms or approaches can be “plugged-in” to the UIMA framework, in a “society of mind” approach that enables the designers to mix and match input, algorithms, and specific strategies in modules. The “society of mind” software framework UIMA, in other words, was already available, and seemed a perfect fit for the requirements of the Watson system the Watson team had identified from their initial tests, and examination of the game.

With UIMA, the complicated task of analyzing Jeopardy! questions, finding evidence for candidate answers, computing a confidence in these answers, and making a “go-no go” decision quickly could be decomposed into a number of subtasks, each playing a particular role, so that the complexity of the overall problem could be made manageable. Key to manageability here, also, was UIMA's support for massive parallelization. Separate software processes or what computer scientists call “threads” can handle different language processing tasks in the UIMA framework. This means that the system can be run in parallel, on a super computer platform, and so a thousand different tasks, all running in parallel, don't mean a thousand times slower results. In other words, IBM's UIMA was the society of mind framework Ferucci and the others needed, enabling a Watson system to run fast and efficient by computing many tasks in parallel.

Thus the Watson team, by 2007, had identified their own UIMA as the underlying architecture for the Watson system. Additionally, they had the existing PIQUANT system used in TREC competitions to use in getting a rough “baseline” performance on Jeopardy!—something to beat with the new and improved system. From the baseline PIQUANT performance, they could set an aggressive target for the more powerful Watson system, and begin work building the necessary improvement in the UIMA framework.

As might be expected, PIQUANT was miserable at Jeopardy!, performing at a scant sixteen percent precision (16%Precision@70) in early testing on legacy Jeopardy! games, which would not qualify it to play any real Jeopardy! games, let alone get into the winner's circle. An entirely new approach thus became necessary. Enter the future of IBM Research. Enter DeepQA.

5.1. DeepQA

DeepQA is an “extensible software architecture” built on top of UIMA, specifically designed for natural language processing tasks (called a “pipeline” when the tasks are all computed from question to answer), to power state of the art QA applications. DeepQA is the “guts” of the Watson system. Simply put, DeepQA takes input (a question), parses the question to find out what is asked, and then outputs its best guess (the answer).

Importantly, the design philosophy underlying DeepQA is never to assume that any part of the system, by itself, “understands” the question and hence can simply look-up the correct answer. Rather, different candidate answers are generated based on the analysis of the question and question-answer pairs are scored to produce a ranked list at the end of all the processing.

The system design of DeepQA is intended to minimize the possibility of the system getting “locked-in” to a particular answer, when additional evidence discovered downstream in its processing pipeline might suggest another one is superior (which only becomes apparent later). In short, DeepQA was designed to use the “society of minds” approach in actual software, running on actual computers. It was designed to really play Jeopardy!

Broadly speaking, there are three main components to the DeepQA architecture. First, the question must be analyzed to “understand” what is being asked. Second, a search of internal knowledge repositories (structured and unstructured—to be discussed) must be performed to generate a large list of candidate answers. Third, the candidate answers must be analyzed, scored, and a “best guess” answer must be provided to the system, which will decide based on the confidence in accuracy whether it “buzzes in” to answer a question, or declines.

We'll take a close look at these components of the DeepQA architecture in an upcoming section. We turn next to a second major feature of Watson—not a “piece” but a process, or rather a methodology.

5.2. Adapt Watson

DeepQA was in place by the end of 2007, but the Watson system wasn't tuned for performance or accuracy yet. In other words, the scaffolding of a new and much improved system had been built, but the actual pieces of it (components)—including advanced machine learning components—had not yet been determined. Given the intent to use learning components, a methodology that helped the team quickly train and test the system, evaluate results, and make necessary changes was necessary.

Machine learning researchers are familiar with the necessity of training and testing on datasets to make incremental improvements, but the Watson team needed a way of analyzing scores of different processing components, making sometimes minor adjustments to certain parts of the system, putting everything back together and recording the overall result. (Remember all those components of the pipeline? The Watson team needed a way of pin-pointing different ones, retrieving partial answers or output along the way, to fully tweak the Watson system. A tall order.)

Given that Watson would be playing a game only an hour long, some of this “tweaking” would need to be performed while a game was underway. (An even taller order.) The vision of getting Watson game-ready was like that of a race car, where rapid changes of tires and other modifications are made during the course of competition, while others are made before or after the competition, in anticipation of upcoming games.

This type of interaction with such a complicated system required its own methodology, which the Watson team gave a name: AdaptWatson. AdaptWatson was put in place in 2008, and it made possible the rapid research, development, integration, and evaluation of “more than 100 core algorithmic components.” Watson's “society of minds” could each be reviewed, while it was thinking, one might say.

With AdaptWatson in place, the Watson team grew to about twenty five members, and included student researchers from university partnerships. Thousands of experiments were performed before Watson “went live,” resulting in incremental (and sometimes drastic) improvements to its performance on actual Jeopardy! questions. In the months that followed, the DeepQA pipeline using the AdaptWatson methodology, along with over 200 eight-core servers crunching all the data, bumped Watson's performance from the baseline PIQUANT score of 16%Precision@70 to over 85%Precision@70. This latter performance qualified Watson to play against human champions at Jeopardy!

This, then, is the story, the abridged version, of how IBM Research created a software system to play Jeopardy! But, how, exactly, does DeepQA “work”? What is the magic in the guts of the Watson system that enables it somehow to “read” Jeopardy! questions in real-time, and find answers reliably enough using its own onboard resources (the system, like human players, could not have recourse to the Web during game play), that it could play Jeopardy! against human champions?

No system had even approached this level of natural language processing prowess before. Yet Watson—for all its innovation—was essentially using techniques and methods already known to the scientific community. The “magic,” in other words, was in the system as a whole, and in the smart methodology that the Watson team put in place. In what follows, we'll unpack the DeepQA architecture to see just how the Watson team assembled a natural language processing engine that performed this feat.

6. The “Magic”: Natural Language Processing in Watson's DeepQA

Question analysis is the logical beginning for any QA system, including Watson: what is the question asking? And: what sort of answer is right for that question? Watson, as mentioned, uses common NLP components like syntactic parsers and semantic analysis subsystems, and the Watson team followed a design guideline to build general purpose (generally useful) NLP components so that future versions of “Watson” could perform other NLP-intensive tasks.

Still, the components were also “tuned” or “biased” to exploit specific features of the questions in the Jeopardy! game. The overall strategy here was to use the question analysis phase of Watson to identify key elements of each Jeopardy! question, which could then guide the system toward candidate answers that are likely good matches. In other words, like the human contestants, Watson would first have to “understand” (analyze) each question in the game.

In Watson, the important elements of Question Analysis were given specific names by the team: the focus, Lexical Answer Types (LATs), Question Classification, and Question Sections (QSections). The details here are somewhat technical, but we can briefly review the meaning of these terms to give a flavor for the Question Analysis capabilities and approach of Watson.

The focus refers to the part (word or words) of the question that literally makes a direct reference to the answer. Consider the following Jeopardy! question:

POETS & POETRY: He was a bank clerk in the Yukon before he published “Songs of a Sourdough” in 1907.

The focus in this question is the pronoun “he.” Knowing who “he” refers to in this question would, in fact, answer the question.

Likewise, the LATs (Lexical Answer Types) indicate what type of entity is asked for: Person? Place? Organization? In the example, LATs are “he,” “clerk,” and “poet,” indicating that an entity of type Person is asked for in the question. LATs are the syntactic clues about what to search for when looking for an answer. The pronoun tells us it's a person, and “clerk” or “poet” gives us hints about the occupation or other roles of the person, further narrowing the scope of candidate answers for the question.

Question Classification, by contrast, seeks a type for the entire question: what type of question is this? The example question is of type Factoid, which is the most common type of question in Jeopardy! (no surprise). Other question classes are specific to Jeopardy!, such as “Puzzle,” or “Common Bonds,” the latter a type that requires supplying a word that is common to a set of terms (e.g., string bean, string cheese). These questions types are more specific to the intricacies of the game of Jeopardy!

Finally, QSections are parts of a question that are best considered separately, as subparts of the whole question. Constraints on the answers, such as lexical constraints (letter constraints) are a common example of QSections. For instance, in a question beginning “This 4-letter word means…”, the term “4-letter” is considered a QSection by Watson.

Question analysis is thus decomposed into a number of tasks in Watson, using a divide-and-conquer strategy that runs throughout the design of the system. In earlier QA systems such as PIQUANT, questions were analyzed for type (e.g., Factoid), and answers were sought that matched syntactic or semantic features of the question given that type.

In Watson, questions are multiply analyzed in terms of the elements described above, and then further processing is performed on the question to extract information that helps launch many different searches given the many-faced question analysis. The goal is to generate a very large list of candidate answers given the analysis of the question, which can then be used to gather additional evidence and produce a final scoring at the end of the DeepQA pipeline.

Generating a large initial list of candidate answers by deep analysis of the question facilitates finding some answer for a given question—a prerequisite for having the correct answer. (In computer science lingo, this is known as having a high recall: casting a net over a large set of candidate answers, raising the likelihood of finding the right one. You cannot return the correct answer at all, if it is not among the answers you are considering in the first place.)

The actual NLP analysis of the question starts with a syntactic parse. Watson uses a parser known as the English Slot Grammar (ESG) to produce a parse or syntactic “tree” that exposes the basic syntax of the question (e.g., “he” is a pronoun, “this” is a determiner). The results of the ESG parse are then used for something called semantic analysis. In semantic analysis, types are identified (“Yukon” is a Place, “1907” is a Date), and then (even more complicated) predicate-argument structure (PAS) is extracted from the ESG parse and semantic analysis output.

For instance, (publish(e1, he, “Songs of a Sourdough”) means that there is some publishing event performed by the person indicated by the pronoun “he,” whose object is the poem (or creative work) “Songs of Sourdough.” The details here get messy, but keep in mind that the extracted structure (publish (e1, he, “Songs of a Sourdough”) is now in Watson-speak: Watson can use this to perform a search for knowledge that would satisfy the variables in the parsed structure.

Finally, relation extraction outputs any semantic relations that are exposed from the ESG parse together with semantic analysis and PAS processing:
authorOf(focus, “Songs of a Sourdough”)

Here, the relations are called predicates: “authorOf” means A is the author of B (it takes two “arguments”), and similarly “temporalLink” means that the first argument (event), that of publishing, was done in 1907.

The result of this syntactic and semantic processing of questions enables Watson to zero in on candidate answers. Specifically, it enables Watson to perform targeted searches for good candidate answers. For instance, given the binary relation authorOf (focus, “Songs of Sourdough”) and given that question analysis has resolved the question focus to “he,” then a good candidate answer would be the name of the person (author) who wrote “Songs of a Sourdough.” This is the key to understanding the divide-and-conquer approach to the Watson system, embodied in the society of minds vision of cognitive capabilities.

Continuing with this example, candidate answers culled from, say, Wikipedia pages (indexed and stored in Watson's knowledge base) mentioning “Songs of a Sourdough” might have the author, Robert W. Service, which could then satisfy the previously unknown value of the pronoun “he.” (Technically, this is called co-reference resolution in NLP research.) The NLP analysis of the question by DeepQA, then, is used to perform searches (many of them, all in parallel), in order to zero in on candidate answers. As mentioned before (and to be discussed shortly), these candidate answers are then combined at the end of the DeepQA pipeline, assigned a likelihood, and if that likelihood is above a threshold of confidence, returned as an answer by Watson, playing the game of Jeopardy! (One is tempted to add: amazing.)

Finally, note that Watson needs to store Wikipedia pages in its knowledge base rather than access it online. This is because Watson cannot search the Web during the course of a Jeopardy! game. Like its human opponents, it must instead store all the knowledge it will need to perform in a real game situations. Here it is on par with its human contestants. Hence, acquiring all this knowledge is a central concern for Watson. In what follows, we'll examine the types of information Watson has available to it, and the types of NLP analysis performed on these knowledge sources to enable Watson to pick out the needles in the haystack it needs to play Jeopardy! at a championship level.

7. Feeding Watson: Knowledge Acquisition and Information Extraction

The distinction between structured and unstructured information is key to understanding the design decisions made by the Watson team in “feeding” Watson enough facts about the world to enable championship Jeopardy! game play. For instance, traditional QA systems might rely on a repository of structured information such as found in typical relational databases. Such data is “structured” because it is described by the database format (called a “schema”), and can be accessed by structured query languages such as SQL—all computer readable techniques, from the get-go. Databases of facts are great, but how do you put terabytes of Web data into them in the first place? By hand?

Even with advanced computational techniques, translating unstructured text data on the Web into a structured database format is impractical: there are simply too many data sources, containing too much unstructured (i.e., written) data, to make such an approach viable. The Watson team hit upon a compromise, reflecting yet again their penchant for breaking down a problem into smaller, manageable pieces. The team converted all the Web data into semi-structured data by extracting a layer of “meaning” from raw textual sources like Wikipedia. The approach didn't convert Wikipedia into a database, but neither did it leave it completely unstructured as simply text in the form of Web pages.

In fact, the automatic extraction of semi-structured knowledge from text data sources has become increasingly common in recent years. DBPedia, for instance, the repository of factoids like “Born: 1951” culled from Wikipedia pages, populating the now ubiquitous Wikipedia Info boxes, is largely updated with automated methods, tools from the field of information extraction (IE), an important subfield of NLP. Watson, too, uses such an approach to gather facts and knowledge about the world suitable for playing Jeopardy! Web pages are partially analyzed, and enough structure is identified to enable the Watson system to “understand” the results well enough to determine whether a particular result might be relevant. That's all that's needed.

As might be expected, Watson makes use of a large range of Web sources: Wikipedia pages, other online encyclopedias, newswire online, and databases containing specific, useful knowledge, such as geographical locations, are all part of its knowledge repository. Some fully structured (database –ready) information is also extracted into a custom knowledge base (KB), called PRISMATIC. PRISMATIC, along with the semi-structured text sources, are the knowledge repositories that Watson searches once it has analyzed a question. While the analogy is not exact, we might say that these sources are Watson's long-term “memory.” In what follows, we'll look at these “information extraction” techniques that Watson uses to extract usable information from text resources, in particular comprising the PRISMATIC KB.

The NLP community has developed over the years (even decades) a set of “shallow” Information Extraction (IE) techniques. Watson's extraction strategy uses these techniques but also expands on and customizes them. In the Watson system, these techniques are used to convert Web pages into searchable formats, and because the Web is almost entirely comprised of such unstructured text, IE techniques are indispensable. The days of adding facts and knowledge from Web pages by hand are long gone; there's just way too much data today.

To take one example, syntactic parsing (mentioned previously) is used by Watson to identify the syntactic or grammatical structure of sentences in English. Parsing sentences is well explored territory in NLP, yet it remains an “unsolved” problem in language research. This is due mainly to the inherent ambiguity of language.

Noam Chomsky and many others, who've performed seminal work on parsing natural language, pointed out long ago the basic problem: words and sentences can mean more than one thing, depending on context. For instance, “time flies like an arrow” generates an ambiguous parse, due to its inherent polysemy, or “many meanings,” of its constituent words, which profoundly affects the interpretation of the whole sentence. (In the example, is “flies” a verb or a noun? If a verb, the sentence means one thing. If a noun, quite another. Try it!)

This brings us back to the “society of mind.” How accurate does a syntactic parse need to be when it is embedded in an end-to-end system like Watson? The Watson system will consider many different processing results and combine thousands, even millions, of items of evidence to arrive at a final answer. Given that the system explores so many options, is a slightly inaccurate parse on one of them really a deal-breaker?

According to Ferrucci, in discussing the Watson project in the special edition of IBM Research Technical Reports, there may be (persistent) unavoidable error rates even in state-of-the-art parsers, but it doesn't matter. A complete system can factor in error rates of individual components of the DeepQA pipeline. And the Watson approach—DeepQA, AdaptWatson, and so on—may in the end yield better real world results than the traditional strategy of researchers, obsessed with reducing those error rates to zero (a task that has proved inherently difficult, again, due to the intrinsic properties of natural language). Watson may not “understand” English, but it nonetheless solves problems that stupefied earlier, state-of-the-art systems. Moreover, its success suggests that NLP progress is not always about optimizing specific results but rather optimizing entire, smartly developed systems.

Given that the conversion of Web content into usable data is so important to Watson, we'll take even a closer look at its IE capabilities in DeepQA.

8. The Primary Information Extraction Components of DeepQA

We've already seen the major IE components in the DeepQA pipeline: the English Slot Grammar (a syntactic parser), a predicate-argument structure (PAS) builder, and a named entity recognizer (NER) (for semantic types like Person). Add to these two more components: a co-reference resolution system (resolving pronouns like “he” to their intended referent, like “John Smith”), and the relation extraction system (“authorOf,” and so on). We can briefly review the features of these subsystems in this section, as some understanding of their purpose is necessary to understanding the overall language analysis capabilities of DeepQA. (This is the most technical part of the article, which can be skimmed by readers less interested in the “guts” of the Watson system.)

8.1. Parsing – English Slot Grammar Parses for “Shallow” plus “Deep” Structure (Oh My)

As noted earlier, parsing natural language is a hard problem because language is inherently variable, ambiguous, and polysemous (“many meanings”). Yet NLP systems often rely on some form of parsing in order to expose the basic structure of language: where are the verbs? The nouns? And how are they related in the sentence? Knowing that a particular word is a verb, for instance, is the first step in constructing an interpretation of a sentence that expresses an action, like running, or hitting, or what have you (the action is expressed by the verb).

The English Slot Grammar (ESG) used by Watson exposes this syntactic structure of English sentences. But it also exposes a “deep” structure that reveals logical connections between syntactic parts of the sentence. The logical structure in DeepQA is referred to as the predicate-argument structure, or PAS, which we mentioned briefly in the section on Question Analysis, but will here explain in a bit more detail.

Consider the following simple example: Mary gave John a book. In ESG, the surface (syntactic) structure of the sentence is this complicated expression:
Mary subj (Subject)
gave -
John iobject (Indirect Object)
a ndet (Noun Determiner)
book object (Direct Object)

We note here that the verb “gave” has no explicit “tag,” known as a “slot” for the ESG parser. The interpretation of the sentence “Mary gave John a book” by Watson's ESG parser is that the action of the sentence is “gave,” and action acts upon the other objects (John, book). In the purely syntactic sense, these object “slots” for receiving the action consist of common grammatical roles, like object or indirect object (familiar to most grammar school students).

There is a semantic interpretation of this sentence, too, however, and this constitutes the deep level of an ESG parse. For instance, our example sentence could be interpreted logically (ignore details) as
(∃e)( ∃y)(book(y) · give1(e, Mary, y, John)), where e describes the event (gave), and y describes a book (some book), and Mary is the actor (the “giver”), and John is the receiver.

The key here is to note that now “John” and “Mary” are not just syntactic “slots” but have actual semantic meanings or roles, like “giver” and “receiver” with respect to the action.

Both the surface level syntactic analysis as well as the deeper logical structure of English sentences are generated by Watson's ESG parses in DeepQA. DeepQA can then exploit the surface and deep analysis of textual components in its pipeline, from question analysis to knowledge extraction of text resources (like Wikipedia), to the analysis of candidate answers. NLP, or more specifically IE, begins with the ESG parses in Watson.

8.2. The Predicate-Argument Structure Builder

After ESG, a predicate-argument structure (PAS) is built from the constituents of the parse. Returning to our original example, “He was a bank clerk in the Yukon before he published ‘Songs of a Sourdough' in 1907” (an actual Jeopardy! question), Watson's PAS builder takes as surface and deep analyses from the ESG parser (just discussed), and produces a simplified logical “structure” for the sentences in each text. The structure is known as a predicate argument structure, and it neatly captures a lot of information that can be used by Watson (and other NLP systems). Variations in sentence syntax are mapped to this common, simplified form, and this facilitates pattern matching on the PAS structures from a range of candidates, who will now match in a many-to-one fashion. For example, the PAS in our previous example might produce:
Publish (e1, he, “Songs of a Sourdough”)
in (e2, e1, 1907).

These two PAS structures indicate quite a bit about the meaning of the original English sentence: that the publishing event, e1, occurred in 1907, and that the person who published was a “he” and that the name of the publication was “Songs of a Sourdough.”

Watson's PAS results are useful in creating patterns for matching against extracted knowledge in its repositories. And yet, more processing is required. For instance, the pronoun “he” occurs twice in the question: do they refer to the same person, or two different people? Resolving this issue, known as “reference ambiguity,” is the job of the co-reference resolution system in DeepQA. We will briefly review Watson's co-reference resolution capabilities next.

Co-reference resolution is a well-explored task in NLP that seeks to identify references introduced in natural language by, for instance, pronouns. Nominal, or noun, co-reference here is paradigmatic. In this case, nouns (often proper nouns) are introduced into a text (discourse), and then referred back to using pronouns such as “he,” “she,” “they” (in the plural case), or “it” (in the non-gendered case). The proper noun is known as the “referent” or the “anchor.” (When the referent occurs prior in the discourse, co-reference resolution is known as anaphora resolution. Less frequently, co-reference can refer to subsequent mentions. The problem is then known as cataphora resolution.)

Most referents are already mentioned somewhere in the text; in some cases, however, the referents are not in the text but are part of the broader, assumed knowledge of the reader (“the economy” and so on). Co-reference resolution in its full scope is a notoriously difficult problem in NLP, as at the limit it can require extensive knowledge about the broader world, and an appreciation of the context of discourse. Watson, it should be noted, does not solve this general problem, but rather tackles an important subset of it for use in the DeepQA pipeline—for use, that is, in playing the game of Jeopardy!

Returning to our example, in DeepQA, Watson's co-reference resolution then resolves the two occurrences of “he” in the question as referring to the same entity: a male person. (Note that this is also the focus in the question, so resolving the pronoun “he” to a proper noun or person name would also yield the answer to the question. Hurray!) The co-reference step is important in the processing of DeepQA, because, clearly, if the “he” in the question referred to two different persons, this would complicate the generation of candidate answers (which person is the answer to the question, if any?). We turn next to another semantic component important to Watson's DeepQA pipeline, called Named Entity Recognition, or NER.

8.3. Name Entity Recognition

Watson's NER system analyzes the results of the PAS builder to identify the semantic types of any entities mentioned in the text. In the example, the NER component would label “Yukon” as a geopolitical entity, and “Songs of a Sourdough” as a composition. NER, even more so than co-reference resolution, has received extensive attention by the NLP community and in the field of IE specifically. Named Entity recognition typically relies on a previously determined set of types, or “tags.” Common tags are Person, Location, Organization, Date, and Time. Resolution of semantic types is often a prerequisite for additional processing, such as determining whether a particular entity mentioned in a sentence can play the role of agent (i.e., is the entity of a type such that it can “give” a book to another person?). The results of NER simplify the final component in Watson's DeepQA pipeline, relation extraction.

Relation extraction is the task of finding relations between semantic types identified in text by NER. Relation extraction typically occurs in the scope of individual sentences (known as “intrasentential extraction”). The relations themselves are often predetermined, given a particular extraction task (i.e., “authorOf” when this relation is important for finding answers to Jeopardy! questions). The relations, once extracted from text, can then be mapped into a database (known sometimes as an “ontology”) of relations and terms, enabling them to be used for further logical analysis, such as generalizing on the terms to superclasses (“authorOf” might be a subclass of, say, “creatorOf” in some knowledge base expressing taxonomic relations between relations and their terms).

Watson's DeepQA uses a large set of relations taken primary from extant, and popular, knowledge bases such as the Wikipedia-inspired DBPedia, mentioned previously. DBPedia consists of “triples” which express binary relations, a relation with two arguments (e.g., “authorOf (“Ulysses,” James Joyce)). These binary relations are automatically extracted from Wikipedia pages by Watson and comprise a large and growing set of facts about the people, places, and things discussed in Wikipedia.

Infoboxes, the (now well-known) tables of facts that accompany Wikipedia pages, for instance, are generated automatically from an IE system, applied to Wikipedia pages, populating the DBPedia Knowledge Base. Watson makes use of these existing DBPedia relations in DeepQA, which is sensible because much of the information used by Watson to answer Jeopardy! questions comes from online sources like Wikipedia (more on this later). The structured resources like DBPedia can then be exploited without the need to manage separate efforts developing knowledge resources for use by the system.

In our example Jeopardy! question, we've already seen the relations Watson (or rather Watson's DeepQA) will extract from textual sources:
authorOf (focus, “Songs of a Sourdough”)
 temporalLink (publish(…),1907)

Such relations, and the so called “triples” constructed using them (one relation with two arguments), provide knowledge “patterns” for DeepQA to use when matching questions to candidate answers. This is the primary reason for the IE pipeline in DeepQA: once a structure such as “authorOf (focus, “Songs of a Sourdough”)” has been extracted from a question, the task of finding a candidate answer reduces to the task of finding a suitable pattern that matches.

These patterns are now known in terms of their “meaning.” The first argument of the relation “authorOf” takes an entity of type Person (types, as noted earlier, are resolved by NER). Moreover, given that the argument of type Person is also the focus given the question, any person that satisfies the pattern matching the writing of “Song of a Sourdough,” in the year 1907 (and having location “Yukon,” additionally, in the particular question), will be a candidate answer to this question (and, it seems, quite a good one).

This admittedly technical description of the core IE components of DeepQA should make clear, nonetheless, the basic task of each component, and how these tasks fit together to reduce the complexity of the question analysis, search, and candidate answer tasks in Watson's DeepQA. This is how the “NLP” of Watson works, and it repays some examination of the details of the system.

While it is outside the scope of this article to delve further into the intricacies of how the Watson team facilitated open-domain QA with the DeepQA system, the flavor of the NLP approach in Watson should by now be clear enough. We'll summarize this section, then, with a big picture wrap-up: (A) both questions and knowledge resources are “run through” the same set of IE components in the DeepQA pipeline, and (B) the primary purpose of this pipeline is to produce semantic structures, such as in the examples given above, that provide patterns to match: question patterns to answer patterns. This is, in essence, how Watson uses NLP to play Jeopardy!

Up to now, we have yet to describe the process by which Watson moves from question analysis to final “best guess” answers. What we have described is how the IE components are used in DeepQA, generally, to transform Watson's input into a pattern-matchable output. In what follows, we'll take a look at DeepQA from a different vantage: how it uses all this NLP horsepower to translate a question into multiple candidate answer searches, and how it then scores a final answer (and decides whether to buzz in). More technical details follow, but with the hope that a more complete understanding of Watson will emerge. To this task we turn next.

9. Watson Matches Questions to Answers

Watson needs a rich set of textual resources, analyzed by the information extraction components, and available for search and analysis in the system, to have a prayer of playing real-time Jeopardy! We just saw how it can analyze these text resources using its IE components in DeepQA, but exactly how are these components put together into a game-playing system? How does it all work? That is the subject of this section.

First, consider that the game of Jeopardy! is particularly challenging as a QA problem, inasmuch as there is no set of topics that can be pre-determined to apply to Jeopardy!: the game ranges over a very broad set of topics that can't really be known in advance. We do know that Jeopardy! questions have general interest to viewers, and so the Jeopardy! topics will tend to involve factoids and other generally accessible knowledge about the world. This makes online encyclopedia resources like Wikipedia particularly relevant.

For instance, the Watson team early on analyzed the percentage of questions (from their stock of prior games in Jeopardy!) whose answers are Wikipedia titles, and found that all but about 4.5% of answers, given a set of 3,500 randomly selected questions, were Wikipedia titles. Of those answers that were not Wikipedia titles, some were conjuncts (multiple entities like “Washington,” “Jefferson,” and “Adams”) and others were synthesized or arbitrarily constructed answers or puzzles, typical of the game of Jeopardy! At any rate, give that over 95% of Jeopardy! answers are likely to be titles of Wikipedia pages, the Wikipedia dataset is clearly front-and-center for Watson. (As we'll see, this all helps us understand how it can perform such “magic” with natural language.)

In addition to Wikipedia, other resources “fill out” Watson's knowledge resources, including Wiktionary, Wikiquote, multiple editions of the Bible, and popular books from sites such as Project Gutenberg. The initial testing (using the AdaptWatson methodology mentioned earlier) revealed that although many questions were answerable given this large set of open-domain knowledge resources, many failures to answer Jeopardy! questions resulted from a dearth of onboard knowledge on tap. In other words, even given all of these resources, the Watson team learned that Watson needed to “know” more about the world to play Jeopardy!—more than was ever required for previous QA challenges, like the earlier PIQUANT system and the TREC competitions. Big Data, you might say, applies to Watson as well.

The Watson team's discovery of the necessity of more knowledge requirements led to a second, or expansion, phase that included resources such as additional (online) encyclopedias and other fact-based textual collections, as well as additional dictionaries and quotations culled from available online sources. The final version of Watson incorporated all of these disparate textual resources—all analyzed and stored in DeepQA.

So, suitable information extraction provides the needed knowledge to Watson. But how does the system use it? Here's how. DeepQA executes a search, after it analyzes a question for patterns (see above), against the knowledge repositories, looking for any and all partial or full matches to the question “patterns” from question analysis. It is important here to keep in mind the distinction between unstructured and structured information sources again. While structured sources like knowledge bases typically yield high precision when there is a “hit,” the recall, or coverage, is often poor, resulting in a trade-off between precision (accuracy) and recall (coverage): high precision answers do not cover a very large percentage of the questions posed to the system. This is exactly what the Watson team discovered.

Recognition of the precision versus recall trade-off in Watson led to both structured and unstructured sources in DeepQA—another “society of mind” trick, if you will. Structured resources were analyzed (i.e., by the IE pipeline), and the syntactic and semantic results were stored in a custom knowledge base known as PRISMATIC. Unstructured information, however, was also stored (indexed) as text or Web pages, and search strategies used by DeepQA retrieve both pages as in a traditional search query, as well as structures in PRISMATIC, depending on the results of a question analysis. This enables the system to get very accurate answers from the knowledge (data) base when available, and greater coverage of candidate answers from the indexed web pages.

Let's look at the unstructured case first. With unstructured data, DeepQA uses two separate open-source search engines to retrieve Web pages from its onboard indices: the popular Lucene as well as the Indri system. The team found that using two separate search engines resulted in greater overall accuracy than attempting to optimize one engine (again, “society of minds”). Both engines were used to retrieve whole documents that have high relevance to a particular question, and from the documents to retrieve highly relevant passages for further analysis.

Passage analysis, in general, was an extremely important part of the DeepQA search and candidate answer generation strategy. In general, an unstructured search for, say, a whole document with a title-match given the results of some question analysis, starts by extracting the most relevant parts or passages of the document to the question. It then analyzes the relevant (highly scored) passages to look for specific candidate answers. Note here, again, that even given the result of a highly relevant candidate answer, the final “best guess” by Watson is delayed until all other candidates are returned, additional evidence for good candidates is found and analyzed, and a final score is assigned to the candidates. This process will be discussed in more detail in upcoming sections. For the moment, we're looking more closely at how whole documents and, in particular, relevant passages of documents are used by DeepQA to find a fit or match to analyzed questions. Let's take a more direct look at search, now.

Search in DeepQA: Finding Candidate Answers in Structured and Unstructured Content

The overall goal of search in DeepQA is to retrieve as much relevant content as possible; recall is the initial goal, rather than precision (remember: we first have to have the answer in the candidate answers to have a prayer of winning). Search is about queries. Queries to structured and unstructured knowledge (extracted content, as explained) in DeepQA is called primary search. Primary search is what Watson does after it finishes analyzing a question. In what follows, we'll discuss the basics of primary search in DeepQA. Additional searches performed by DeepQA are discussed later, in the broader discussion of searching for supporting or additional evidence given the identification of candidate answers after primary search.

As with human players, a Jeopardy! question must first be “understood” somehow before more complicated searches and other strategies to find plausible answers are performed. However, and importantly, the question analysis in DeepQA does not presuppose that the question is fully analyzed; mistakes in the analysis of a question (such as a misidentification of a LAT or a focus, say) might degrade overall performance, but because the system seeks evidence from far ranging sources and puts off final scoring of answers, even incomplete or inaccurate information about the question need not fully propagate error downstream to the extent that a failure at the question analysis stage necessitates a wrong answer.

Watson is, to some degree, “fault-tolerant” in this relevant sense: some amount of error need not steer it to the wrong answer. Watson's approach is thus robust to certain “dumb” errors typical of QA or NLP systems before it, historically. We turn next to Watson's treatment of all-important search, the next phase after question analysis.

In primary search, the goal is to optimize recall (coverage): find as much relevant content as possible. In DeepQA, the operative goal was set at 85% binary recall for the top 250 candidate answers. “Binary recall” is the percent of questions for which Watson's search returns a correct answer—even if that answer has not yet been identified in the results. Hence, 85% binary recall for the top 250 answer candidates means that in fully 85% of the total questions considered by the system, the correct answer appears in the top 250 ranked candidates.

As mentioned, search queries for candidate answers are generated from the analysis of a question. For each question, a query is generated that exploits syntactic and semantic features of the question discovered during question analysis, such as entities (from NER), relations (from relation extraction), and question-specific types such as the focus or LATs. Importantly, arguments to relations involving the focus are considered more salient in the context of the query. For instance, consider the Jeopardy! question:
MOVIE-“ING”: Robert Redford and Paul Newman starred in this depression-era grafter flick. (Answer: “The Sting”)

Question analysis by DeepQA reveals that the focus is “flick,” which is the head of the noun-phrase “this depression-era flick.” There are no co-referential terms in the question, so the only plausible LAT is the focus itself. Relation detection generates two instances:
actorIn(Robert Redford, flick: focus)
actorIn(Paul Newman, flick: focus)

Given these results from question analysis, search query generation proceeds as follows. First, arguments to any relations having focus terms are given additional weight (the focus is more important!). For the example above, both relations extracted from the question contain the focus term: “flick.” This results in a query with numerical weights assigned to terms, reflecting the expected additional importance of terms closely related to the focus term:

(2.0 “Robert Redford”) (2.0 “Paul Newman”) star depression era grafter (1.5 flick)
Here, the numerical weights double the importance of the two actors in the query, while the focus term itself receives 1.5 times the importance relative to the other terms in the query (the terms “star depression era grafter”).

Additionally, if the LAT has modifiers, as is the case with the example question here (“depression era grafter” modifies “flick”), then a separate query is generated containing only the LAT and its modifier terms: depression era grafter flick. The additional LAT plus modifiers-only query reflects the observation, discovered in testing DeepQA on Jeopardy! questions, that LAT-only queries can sometimes uniquely identify an answer (intuitively, “depression era grafter flick” may occur with the title of the movie—the answer—in Web resources such as, for instance, the Internet Movie Database (IMDB), or a Wikipedia page about the movie, or other resources).

An important point to take away from Watson's search strategy is how tailored it is becoming to the actual nuts and bolts of Jeopardy! We'll return to this theme in the final sections. Next, we turn to the unstructured, Web page, search strategy used by Watson.

Unstructured Search Strategies: Document and Passage Search

As mentioned, DeepQA uses not one but two separate search engines for unstructured text search, each with different strengths and weaknesses. The popular open-source Lucene search engine is used to perform full document search in a first phase, then to extract relevant passages from retrieved documents in a second. Lucene's Document Search was modified by the Watson team to optimize for the QA requirements of the Jeopardy! game. For instance, three query-independent features were discovered that apply to individual sentences in documents searched after a Jeopardy! question: sentence offset (sentences that appear closer to the beginning of a document), sentence length (longer ones are more likely to be relevant), and the number of named entities (the more named entities, the more likely the sentence containing them is relevant to a query). Query-dependent scores here are combined with other similarity scores in Lucene to create a modified (optimized) version of Lucene for use in DeepQA.

Passage-specific search was primarily performed by the open-source Indri search engine. Unlike Lucene, Indri searches can be tailored specifically to document passages, making the first phase of the Lucene search strategy unnecessary. However, the Watson team discovered that the combination of Lucene and Indri search yielded the best results overall. (Again, we see the “society of minds” theme, where many different components, subsystems, and approaches and strategies all “weigh in” on plausible answers, permeating the design and development of DeepQA and Watson throughout.) Further details can be found in IBM's Journal of Research, special articles on the Watson system. We turn now to the unstructured, knowledge base searches performed by Watson. This is PRISMATIC.

Structured Search: Answer Lookup and PRISMATIC

In addition to unstructured document as well as passage queries using Lucene and Indri, DeepQA also performs structured data search using both a “lookup” strategy from existing structured knowledge bases such as the IMDB (Internet Movie Data Base) and DBPedia, as well as the open-source ontology YAGO (Yet Another Great Ontology), as well as querying the Watson-generated PRISMATIC Knowledge Base (KB). It is noteworthy here that Watson's search strategy is primarily tailored to unstructured content (just discussed). In contrast to tradition NLP and QA, structured queries were found to lack adequate recall for the Watson system. As the Watson team put it: “problem of translating natural language into a machine-understandable form has proven too difficult to do reliably.”

This is an accurate, yet still astounding statement by the Watson team, because it admits that the major research direction of historical (and still ongoing) NLP and QA research is in effect tackling language problems that are simply too hard to reduce to automated techniques. Notably, however, Watson achieves a deep NLP result in open-domain QA without recourse to the traditional techniques and strategies—that of producing a logical interpretation of the question that enables an SQL-like or fully-structured query against a KB for a definite answer. In what follows, however, we'll look at the fully structured queries that Watson does perform, which is a small—yet important—part of the overall search strategy in the system. (Again: while they are not broad coverage for the range of Jeopardy! questions, when they match, they are high accuracy.)

Structured queries can be reduced to “look-up” whenever question analysis produces a structure (say, a relation) with a single argument spot that constitutes the answer. For instance, in the “flick” example discussed above, the name of the movie (substituting for the focus term “flick”) would, if found, produce a match to the relation, and also provide the answer to the question: “The Sting.” In this case, a structured query against, say, the IMDB results in a high precision candidate answer; there is no need, in such cases, to perform more exhaustive and open-ended unstructured queries. This process is called “Answer Lookup” in Watson.

In addition to Answer Lookup, Watson also can query its onboard knowledge repository PRISMATIC, populated by automatically extracting relations, entities, and other structures from its unstructured resources (using the IE components in DeepQA, as explained previously). According to the IBM team, PRISMATIC contains “shallow semantic information” culled from analyzing massive amounts of free text. An example of a "shallow” semantic relation is the “instance of” or “is a” relation found in many taxonomies and knowledge bases with classes as well as instances of classes. Consider the following question:

“Unlike most sea animals, in the Sea Horse this pair of sense organs can move independently of one another.”

In this example, the unstructured document or passage search in Watson does not retrieve the answer, notably because the answer is relatively obscure given the words in the question. This type of question in Jeopardy! is rare, but occurs with enough frequency to make certain structured query strategies a practical necessity in the system, as having no answer to such questions would produce a noticeable and serious degradation in performance—rare, but useful when applicable. PRISMATIC search uses the LAT, “sense organ,” and retrieves instances of sense organs from PRISMATIC. The 20 most popular sense organs are retrieved as candidate answers for the query.

Of these, the third most popular answer, eyes, is in fact the correct answer. Given that “sense organs” and the answer “eyes” may not co-occur in documents with much frequency (if at all), Watson's unstructured document or passage search strategy is clearly unsuitable for this question.

By contrast, executing a structured search against PRISMATIC given the LAT “sense organs” reveals the answer, even when such semantic connections (“sense organs,” “eyes”) do not naturally occur in text documents. Hence, the combination of unstructured and structured resources and the accompanying queries tailored to such resources result in a greater overall recall for Watson. This multitudinous approach to open-domain QA is itself an innovation of the Watson—or, rather, of the Watson team.

Candidate Search Generation

After the retrieval of search results, Watson must now sift through the results to identify candidate answers. This is a separate task in the DeepQA pipeline, occurring after the results from search. For candidates retrieved from structured content, typically the structured results themselves are also the candidate answers. By contrast, results retrieved from unstructured queries must be further processed and analyzed by Watson (a passage from a Web page, for instance, must be further analyzed).

In the unstructured case, however, the dominance of title-oriented documents like Wikipedia pages in candidate answer generation substantially reduces the complexity of this part of the pipeline. As noted already, the Watson team ran tests on candidate answers and discovered that fully 95% of all answers in Jeopardy! match, specifically, Wikipedia titles. This means of course that Watson can perform very well at the game of Jeopardy! if only it can successfully analyze questions to reveal queries that return the correct Wikipedia titles as answers. In the remaining 5% of cases not covered by Wikipedia titles, structured queries and other approaches such as passage search can fill in missing items of information. This result is somewhat surprising and perhaps a bit ironic, as the huge complexity of the open-domain QA task in the game of Jeopardy! has noticeable—perhaps surprising—simplifications. In fact, Wikipedia titles carry much of the burden of the QA task for Watson, and this fact was discovered, and then built into the system, by the Watson team.

Still, Watson must identify the correct titles in millions of Wikipedia pages in order to produce correct answers using this strategy. Also, some questions cannot be answered by analysis of Wikipedia titles, necessitating PRISMATIC and other structured search and analyses. In general, the Watson team found in an analysis of 429 questions that roughly 75% of answer failures resulted from search failures: to execute the correct search, multiple analyses must be accurate enough to generate queries that can retrieve actual answers. Yet the complexity even of one-sentence questions in the game of Jeopardy! presents significant NLP challenges to Watson.

One notable difficulty discovered by the Watson team involves the inclusion of extraneous or “extra” terms in the question text, which have no relevance to the answer and serve only to generate queries with irrelevant terms, making for pointless searches that return irrelevant results. For example, in the question “Star chef Mario Batali lays on the lardo, which comes from the back of this animal's neck,” the font-bold text is relevant to answering the question, but the other text including “Mario Batali” is completely irrelevant. In question analysis, however, Named Entity Recognition along with the ESG+PAS parse results, and relation extraction, would result in queries containing these terms.

Such irrelevancies are particularly hard to fix or guard against, because common nouns or verbs are typically less relevant to search results in QA—proper nouns, like “Mario Batali,” are more often relevant to answering questions. Hence, such cases represent anomalies that at present have no easy fix for the Watson team, and no NLP strategies present themselves as complete solutions, yet.

Hypothesis and Evidence Scoring

After an initial set of candidate answers is generated, a follow-up process of collecting additional confirming or disconfirming evidence for candidates begins. This process is highly parallelized, and can generate thousands of numerical scores attached to various pieces of evidence culled from Watson's extensive knowledge resources. All of these scores must somehow be combined to yield a final rank of candidate answers; this process is discussed in the upcoming section. Here, we take a closer look at this additional evidence gathering and scoring process in Watson.

Soft filtering is used to narrow the initial (potentially huge) set of candidate answers, so that computational resources are not expended on highly implausible candidates. However, some candidates might appear implausible initially, but given the accumulation of additional evidence may surface as highly likely answers, so the process of filtering initial candidates must use strategies that are themselves unlikely to throw out potential candidates, discovered downstream in the DeepQA pipeline.

For instance, a soft filtering technique used in Watson might compute the likelihood that a candidate answer is an instance of the LAT (computed from question analysis). If a candidate answer is not an instance of the LAT, it isn't a good candidate for answering the question, and so, given a reasonable confidence that the LAT was correctly identified in the first place, non-matching candidates can safely be filtered. This is a broad brush, first phase of filtering used by Watson.

Once soft filtering is applied to initial candidates, Watson searches for evidence for the candidates in order to either further confirm, or disconfirm, them as final answers. This is known as evidence retrieval, and involves scores of techniques that are either general in nature (given, say, QA broadly considered) or in some way specific to the game of Jeopardy! (A detailed discussion of evidence retrieval is outside the scope of this article, but an example might be to search for additional passages in free text that corroborate the initial candidates.)

After evidence retrieval, a complex set of machine learning techniques is applied to the candidates and the retrieved evidence to calculate the likelihood that given candidates are answers to the question. The scoring system, like the evidence retrieval system, is complex and multifaceted, and thus outside the scope of this essay. We can illustrate scoring with an example.

Given the question “He was presidentially pardoned on September 8, 1974” and a retrieved passage “Ford pardoned Nixon on Sept. 8, 1974,” Watson's scorers might measure (a) the weighted word counts between question and passage, (b) the lengths of longest subsequences in common, or (c) the degree of match between the logical forms (LFs) of question and passage. Each scorer computes a separate score and adds it to the overall scoring for a candidate answer. When this process if completed, a final score is computed from the individual scores. Final scoring is discussed next.

Answer the Question! Ranking Candidate Answers in Watson

In Watson, most of the DeepQA processing pipeline is dedicated to (a) analyzing an incoming question, (b) executing queries against (analyzed or unanalyzed) knowledge resources, like Web pages such as Wikipedia, or IMDB, Wiktionary, and others, (c) identifying candidate answers based on matching features of the question to features of the query results (say, a passage in a Web page that fills in a part of an extracted relation), and finally seeking additional evidence to support or reject the candidate answers by executing follow-up queries based on initial results.

At the end of this complicated process, executed in massive parallel on the UIMA software framework used by Watson, all of this analysis is brought together, and Watson assigns final scores to the candidate answers, given the Jeopardy! question. The top rank answer is then either provided by Watson in the game (Watson “buzzes in” to answer the question), or if Watson's confidence in the final answer is not high enough, the system passes on the question.

The top scoring answer, then, must be high enough to qualify, especially given that wrong answers result in loss of money in the game of Jeopardy! In other words, Watson must in some sense “know what it does not know,” and use this information to make intelligent game decisions.

In what follows, we'll take a look at the final stage in DeepQA processing, where Watson receives a final score for its generated candidate answers. We'll then cover, briefly, the game-specific implementation details of Watson that enables it to play live games of Jeopardy! in real-time. We'll conclude with some broader considerations of the status of Watson as an NLP system, and even more broadly, as a contribution to the ongoing project of building truly intelligent machines: Artificial Intelligence, or AI.

Final Merging and Ranking

There are, potentially, hundreds of thousands of individual scores generated by Watson's DeepQA pipeline for what the Watson team calls its hypotheses (same as candidate answers). Some of these hypotheses are different only in surface form, but are logically or semantically equivalent at a deeper level. Scoring semantically equivalent hypotheses separately would waste computational resources, and so the initial step in final scoring is to merge such candidates together. Various techniques involving, for instance, co-reference resolution (“he” is the same as “John Smith”) are used to perform the initial merging step.

After candidates have been merged, Watson uses a machine learning approach. Again, the details of how machine learning works are outside the scope of this article, but in what follows we can discuss in broad outline how Watson makes use of a learning approach for scoring hypotheses. First, the machine learner is trained on a large set of questions with known answers (this is called the training set). The output of the machine learning algorithm, after it is trained, is a numerical score representing the confidence Watson should have in each candidate answer, given a question.

Machine learning is a sensible (probably necessary) strategy for combining evidence in the DeepQA system, inasmuch as manually writing rules to provide relative weights for hundreds of thousands of bits of evidence would be unfeasible. Rather, a confidence estimation framework was developed by the Watson team, facilitating the training, testing, and deploying of machine learning models (i.e., algorithms) that can efficiently and accurately handle the disparate evidence generated by the DeepQA pipeline. In other words, with hundreds of thousands of hypotheses, Watson uses a learning approach to find the best ones.

The machine learning system for final ranking takes as input a set of candidate answers along with their evidence scores. The model trained generates both (1) a rank (ordinal) order for each candidate passing the soft filter, and (2) a score representing the confidence that each rank is correct (i.e., a number representing what it knows, and another number representing how confident it is that it knows what it thinks it does). Most of the learned scores measure the “fit” of the answer to the question, but there are other scores that measure aspects of the question (i.e., the LAT).

Initial training consisted of about 25,000 Jeopardy! questions making 5.7 Million machine learning instances, or training examples (question-answer pairs plus what are called “features” for the learner to use when training). Each instance had 550 distinct features. Although a number of powerful learning algorithms were tested, including approaches such as Support Vector Machines (SVMs), single and multilayer neural networks, decision trees and others, a technique known as regularized logistic regression was used by the final production system. As the Watson team discovered, the technique, while simple, nonetheless delivered consistent good performance on Jeopardy! (and that's what counts).

The learning framework for final ranking in Watson defined seven separate phases of learning, where each phase modified some aspect of the learning, re-running the learning algorithm on the modified result. The phased system increased the computational resource requirements for Watson, but significantly increased the overall accuracy of the final answer ranking. While an exact description of each phase is outside the scope of this essay, some example phases include so-called “Hitlist” normalization (ranking and retaining the top n answers for additional processing), answer merging (creating a canonical form for equivalent answers), and transfer learning (transferring the result of one phase of learning to another, in order to highlight certain features that may bear importantly on the learning problem but occur too rarely by themselves).

After many iterations of training, testing, and tweaking Watson's “confidence estimation” framework using the learning approach, the system was improved from about 70% precision@70 to the goal of 85% precision@70 (remember, this means 85% precision on 70% attempted). This consistent performance by Watson not only qualified it to play Jeopardy! with human champions but also guaranteed that the system would, in general, outperform the best human players at the game. Watson, in short, became a superhuman contestant on the popular gameshow. And what we've described, in some (perhaps excruciating) detail, is exactly how it—or rather the Watson team—accomplished this considerable feat.

In the final technical section of this essay, we'll take a quick look at the components of Watson that enabled it to play real-time Jeopardy! against human opponents. While the DeepQA pipeline is the true “engine” of Watson enabling it to accurately—and quickly—find answers to Jeopardy! questions, the computer system needs to perform Jeopardy!-specific functions in order to actually play the game as a contestant. Among these functions, importantly, are Watson's capability to make intelligent, strategic wagers during gameplay. These features of Watson are thus important to end-to-end system performance, even if they are not part of the scientific advances made by the Watson team on, specifically, the open-domain QA task. We turn next to Watson's strategic “betting” system components that make it such a formidable contestant at Jeopardy!

How Watson Wagers: Strategy in Jeopardy!

There are four types of strategy decisions in the game of Jeopardy!: (1) Wagering on a Daily Double, (2), wagering during Final Jeopardy!, (3) selecting the next square (question) when in control of the board, and (4) deciding whether to answer or “buzz in.” Strategy profoundly affects the outcome of Jeopardy! games; poor wagers on Daily Doubles or in particular on Final Jeopardy! can result in a loss, even if a player has been dominating the board up to that point. As such, Watson must not only perform well on the core task of analyzing questions and finding correct answers, but it must also have computational strategies for wagering and, in general, playing the game of Jeopardy!

The buzz-in decision is the mainstay of Jeopardy! play, as any Jeopardy! fan knows. After a clue is read by the host, each contestant has five seconds to answer. A light goes on, enabling the buzzers for each contestant, and the five second count begins. Prior to the light, the buzzers are disabled. Watson does not visually read the board as do human contestants; the system receives the clue via computer text before the buzzers are enabled but after the clue is read, so that information delivery coincides with that for the human contestants. It thus has no advantage, nor disadvantage, in this regard.

For Watson, as with human contestants, the question must be analyzed (understood), and a decision must be made within five seconds whether to buzz in with an answer, or to pass on the question. This is why such attention was paid to confidence estimation in developing the DeepQA framework. Watson must “know” whether its best answer is good enough to wager, since an incorrect answer will result in Watson losing the dollar amount for the clue. (No answer leaves a contestant's dollar amount unchanged, while the correct answer—assuming it is the first one buzzed in during the allowable time—results in that contestant earning the dollar amount of the clue.) Watson must receive not only a ranked list of answers from DeepQA, but also a confidence estimation for each answer representing the numerical likelihood that the answer is correct. If that score falls below some (empirically determined) threshold, Watson will not buzz in.

In Final Jeopardy! contestants must wager a dollar amount less than or equal to their total earnings up to that point in the game. Final Jeopardy! wagering has received attention from game theorists, and a number of heuristics have been suggested over the years to optimize betting strategies in Final Jeopardy! given the dollar amounts of the three contestants. Since contestants do not know the clue before betting, very little subjective estimation of a priori confidence in the clue is possible. Each contestant is given the category for the clue (e.g., “American presidents”), and so an assessment of their strength in that category can be made.

From extensive analysis of archived Jeopardy! games (from the “J! Archive” Website), the Watson team discovered that contestants have about a 50% chance of answering a Final Jeopardy! (FJ) question correctly. Champions or Grand Champions have respectively about a 60% or 66% chance of answering correctly. It was discovered that the most important factor in FJ wagering is what's called “score positioning,” which is to say whether a contestant is in first, second, or third place going into FJ wagering. On the basis of this empirical observation, what are known as stochastic-process models were developed by the Watson team that optimize the chances of Watson's success (final dollar amount outcome conditioned on the other two contestants) given its score position in the game going into FJ betting.

In Daily Double betting, a cell on the board (the Daily Double cell) is hidden until a player selects it. There is one Daily Double cell in the first round of play, and two Daily Double cells in the second round. When a contestant selects the Daily Double (DD) cell, he or she does not compete for the “buzzer” with other contestants but must provide an answer to the clue regardless of the contestant's confidence in the answer. He or she must also wager a portion of earnings—minimum bet is $5, while the maximum is the greater of the total clue value left on the board, or the entire dollar amount of the contestant's current score.

DD betting is different from FJ betting in that the game does not end with a DD bet, so the strategy requires forecasting how a bet will affect the long-term outcome of the game. As with FJ strategy, the Watson team developed Watson's DD betting strategy from close inspection of thousands of actual Jeopardy! games taken from the J! Archive website. The mean accuracy for human contestants was about 64%, while Champions won about 75% of DD bets, and Grand Champions fully 80.5%. This set became the baseline for developing a stochastic betting model for Watson.

To optimize Watson's DD strategy, a simulation model was created to run multiple iterations (simulations) of betting between two (simulated) contestants and Watson. This type of stochastic optimization is known as Monte Carlo modeling, and can be an extremely powerful technique for finding optimal solutions when various factors remain hidden or unknown. With Monte Carlo modeling, the simulations are run over and over again with different starting points, until an optimal solution begins to emerge from the data.

Technically, the Watson team trained a Game State Evaluator (GSE) over millions of such simulated games, using a “nonlinear function approximator” known as a multilayer perceptron (a neural network, basically), which estimates the likelihood that a given player will win a given game. Using these techniques, Watson would adjust its wagers based on (a) its confidence in the category, (b) its current earnings, and (c) the current earnings of the two opponents.

The overall superiority of Watson's betting strategies in Jeopardy! was established by comparison with human contestants in the J! Archive data. For instance, historic win rates for humans on FJ are 65.3%, 28.2%, and 7.5% for first, second, and third place contestants going into FJ, respectively. Given Watson's Monte Carlo-Learning based optimization strategies, however, these rates—known as Watson's Best Response rates—increase to 67.0%, 34.4%, and 10.5%, respectively. This shows, definitively, that the Best Response rates developed for Watson outperform actual contestants' performances on actual Jeopardy! games, across the board (for all three places). Watson, in other words, is a superhuman contestant on Jeopardy! It not only answers more questions on Jeopardy! correctly than human contestants, but it also bets better on Daily Double and Final Jeopardy questions.

Here, then, is the Watson system. Armed with this knowledge of the actual nuts-and-bolts of the Watson system, we are now in a position to draw some more general and informed conclusions about it. In wrapping up this essay, we turn to these more general considerations.

Big Picture: Watson, NLP, and Artificial Intelligence Research

In the preceding admittedly technical discussion of the Watson system, we see a clear example of the computational treatment of natural language performing a well-defined, yet intuitively complex and difficult, language task: that of analyzing questions, and providing accurate answers in the time-constrained, gameplay environment of Jeopardy! Watson's success raises the obvious broader question: What is the significance of Watson's success to the overall goal of building the world's first truly intelligent machine?

This is a complicated question, in part because the definition of “intelligence” is rarely part of today's discussion about Artificial Intelligence (AI). Historically, systems were considered intelligent if they could perform a function that once required human intelligence. This definition makes intuitive sense, yet it includes too much: a pocket calculator can perform “cognitive” functions like arithmetic that would require human intelligence without the machine. Yet most people feel that pocket calculators are not “intelligent” in any real sense, and so these examples fail to capture what we mean. Likewise, even tasks previously thought to represent quintessential human cognitive performance, like playing chess, get excluded from consideration of “true” intelligence once AI systems demonstrate their superiority.

This up and back about what constitutes intelligence has led some AI researchers (notably, MIT theorist Marvin Minsky, as well as futurist Ray Kurzweil) to complain that AI is engaged in the unfortunate game of constantly chasing what humans view as intelligence, where every success simply results in a new yardstick for the endeavor. There is some truth to this charge, but perhaps at a more fundamental level it signals our basic confusion over what counts as intelligence, at all. Turning the complaint on its head, we might also observe that AI has been stipulating convenient definitions of intelligence in order to demonstrate progress towards stated goals without ever tackling hard questions about the nature of minds, intelligence, and machines. This point is centrally important to the question of Watson, and so we pursue it a bit further in these concluding thoughts about the significance of Watson.

In what sense is Watson intelligent? Let's review the facts. First, Watson can play Jeopardy! in real-time contests against human contestants, adhering to the very same rules—no tricks. (Technically, this is not true, as Watson does not use a technology such as Optical Character Recognition, or OCR, to actually visually process the clues presented on the board as human contestants do. Rather, it receives an electronic version of the clue, yet in the same timeframe and according to the same rules as human contestants, giving it no advantage.) Given that Jeopardy! would seem to require a range of intelligent skills, including natural language interpretation, command of many facts about our world, and the ability to match sometimes tricky or puzzling questions to correct answers, it is hard to deny that Watson is in some sense “intelligent” in the commonsense notion of the word.

Yet, when we unpack the Watson system, we find a set of mathematical techniques, an intelligent software architecture that enables massive parallelization, and years of testing and improving by human experts. Is the intelligence in Watson, or in the Watson team? The question may seem academic or unfair until we realize that Watson, for all its sophistication, can't do anything but play Jeopardy! While it remains true that IBM has used the Watson QA framework for other domain-analysis tasks (such as health care data analysis), these “versions” of Watson were “ported over” to those domains, where the system was re-tested, re-trained, and re-developed to perform on a different, yet related, QA task.

To put it bluntly, Watson did not re-configure itself to perform another QA task in these cases, but rather the Watson team once again redeveloped a system using the architecture in order to perform computational tasks on another problem. This, though, does not seem the hallmark of intelligence. Rather, what seems to emerge is the following: computers can perform specific tasks very well—even superhumanly. At the same time, general human thinking (with goals, purposes, and sentience) seem to continually elude them. Watson here—thoroughly analyzed in this essay—seems not to give us any fundamental reason to allay our skepticism or outstrip our commonsense.

Here's a question to ponder: is there a bridge from mindless—yet superhuman—task specific performances, to actual minds that think, act, and feel? Watson, for all of its impressive success at Jeopardy!, cannot tell us. (Plausibly, what it can tell us, is that it's not part of a larger, philosophical answer here.)

This skeptical analysis rings true for many people, yet futurists and AI theorists often balk at the suggestion that AI is cold logic whereas human intelligence is something else. In AI circles it's common to distinguish what's called “Narrow AI,” that is, AI that performs particular tasks in circumscribed domains, and “General AI” (previously Strong AI), which is supposed to display a general intelligence indistinguishable from that of our own. AI practitioners and their many supporters in popular culture tend to invoke a “bridge” principle: Narrow AI applications like Watson will gradually (or abruptly, to believe some) expand their capabilities until they are co-extensive with our own. Of course, the opposite might also be true: a principled difference separates Narrow AI successes from the general intelligence of sentient beings like us. The question remains open. Watson, for its part, seems unable to tell us which of these possibilities obtains.

To make these philosophical points a bit more sharply, and germane to the specifics of Watson's performance, consider the following. As described in the sections on the DeepQA pipeline, Watson's QA performance at Jeopardy! begins with an analysis of the clue, or question. In particular, a focus, as well as a Lexical Answer Type (LAT), is identified. Much of the success of downstream processing in Watson relies on the correct identification of these elements of the clue: if the question is not understood, it's intuitively clear that the answer analysis will be compromised. But in Jeopardy!, unlike in everyday conversation, the vast majority of questions are asking for factoids: 95% of answers are also Wikipedia titles.

Thus, the apparent massive natural language complexity of the game in fact reduces to a few “hooks” that the Watson team astutely exploited when developing Watson. Focus words, too, are relatively easy to identify in most cases: “this” or other demonstratives or noun references account for the majority of cases. As such, Watson's impressive performance is understandable without requiring an explanation of general natural language understanding. Watson seems much different from its predecessor, the chess-playing computer Deep Blue. And yet, in a very real sense, it's just a more complicated version of the same: IBM reducing a complicated, yet high-visibility, problem to a set of techniques that can be run on supercomputers. Impressive? Yes. A step towards actual language understanding? Hardly convincing.

What is general natural language understanding? Given Watson's vast complexity and the huge amount of effort invested in its construction, it is perhaps ironic that the answer to this question is simply the everyday exchanges and conversations humans have with each other, constantly. Here, “open domain” does not mean “over a range of topics” but rather “about anything that comes up, based on context, prior history, or any other factors considered relevant by the interlocutors.” This type of natural language performance is easy for humans to perform—we don't find it particularly cognitively challenging in most cases to carry on a conversation with a friend.

Yet, while such natural language performances are easy for us, they represent a high-water mark for computational systems, including Watson. Thus, while the chess playing supercomputer Deep Blue played excellent chess, critics pointed out that it couldn't do anything but play chess. And while chess seemed to require a high degree of intelligence in humans—in contrast to conversation—the Deep Blue program could still master grand master-level chess while having literally no conversational ability whatsoever. And while Watson plays Jeopardy! impressively, it's in the same boat with everyday conversation as Deep Blue—it can't “chit chat,” or converse with us outside the parameters of the game. Thus, while the media attention on Watson seems deserved for the virtuoso performance the Watson team managed with its system, broader speculation about Watson's intelligence at natural language understanding seems premature, and even naïve.

Where does this leave us? Early on in AI (in fact before the term was even coined), computer pioneer Alan Turing proposed his now-famous Turing Test: send questions to a human and a computer via teletype or text, and if a human judge can't tell the difference in response, we should agree that the computer is intelligent like a human. In other words, if a computer can carry on a conversation, we should grant it the status of human intelligence. Turing's Test was widely accepted as a good test of computational intelligence—indeed, the definitive benchmark for the fledgling field of AI. And yet to date, no computer has come even close to passing the Turing Test.

We can rightly celebrate the leap forward in open-domain Question Answering accomplished by the Watson team, which will surely bring other and perhaps unexpected computational capabilities to us in the months and certainly years to come. The question of whether machines can be made to “come alive” and begin taking an interest in the world, and talking to us about it, however, remains thoroughly unresolved. In particular, for all its impressive performance on Jeopardy!, the Watson system still tells us little about the bigger questions of AI.

Popular with our students.

Highly informative resources to keep your education journey on track.

Take the next step toward your future with online learning.

Discover schools with the programs and courses you’re interested in, and start learning today.

woman in an office