At a typical annual meeting of the Association for Computational Linguistics (ACL), the program is a parade of titles like “A Structured Variational Autoencoder for Contextual Morphological Inflection.” The same technical flavor permeates the papers, the research talks, and many hallway chats.

At this year’s conference in July, though, something felt different—and it wasn’t just the virtual format. Attendees’ conversations were unusually introspective about the core methods and objectives of natural-language processing (NLP), the branch of AI focused on creating systems that analyze or generate human language. Papers in this year’s new “Theme” track asked questions like: Are current methods really enough to achieve the field’s ultimate goals? What even are those goals?

My colleagues and I at Elemental Cognition, an AI research firm based in Connecticut and New York, see the angst as justified. In fact, we believe that the field needs a transformation, not just in system design, but in a less glamorous area: evaluation.

The current NLP zeitgeist arose from half a decade of steady improvements under the standard evaluation paradigm. Systems’ ability to comprehend has generally been measured on benchmark data sets consisting of thousands of questions, each accompanied by passages containing the answer. When deep neural networks swept the field in the mid-2010s, they brought a quantum leap in performance. Subsequent rounds of work kept inching scores ever closer to 100% (or at least to parity with humans).

So researchers would publish new data sets of even trickier questions, only to see even bigger neural networks quickly post impressive scores. Much of today’s reading comprehension research entails carefully tweaking models to eke out a few more percentage points on the latest data sets. “State of the art” has practically become a proper noun: “We beat SOTA on SQuAD by 2.4 points!”

But many people in the field are growing weary of such leaderboard-chasing. What has the world really gained if a massive neural network achieves SOTA on some benchmark by a point or two? It’s not as though anyone cares about answering these questions for their own sake; winning the leaderboard is an academic exercise that may not make real-world tools any better. Indeed, many apparent improvements emerge not from general comprehension abilities, but from models’ extraordinary skill at exploiting spurious patterns in the data. Do recent “advances” really translate into helping people solve problems?

Such doubts are more than abstract fretting; whether systems are truly proficient at language comprehension has real stakes for society. Of course, “comprehension” entails a broad collection of skills. For simpler applications—such as retrieving Wikipedia factoids or assessing the sentiment in product reviews—modern methods do pretty well. But when people imagine computers that comprehend language, they envision far more sophisticated behaviors: legal tools that help people analyze their predicaments; research assistants that synthesize information from across the web; robots or game characters that carry out detailed instructions.

Today’s models are nowhere close to achieving that level of comprehension—and it’s not clear that yet another SOTA paper will bring the field any closer.

How did the NLP community end up with such a gap between on-paper evaluations and real-world ability? In an ACL position paper, my colleagues and I argue that in the quest to reach difficult benchmarks, evaluations have lost sight of the real targets: those sophisticated downstream applications. To borrow a line from the paper, the NLP researchers have been training to become professional sprinters by “glancing around the gym and adopting any exercises that look hard.”

To bring evaluations more in line with the targets, it helps to consider what holds today’s systems back.

A human reading a passage will build a detailed representation of entities, locations, events, and their relationships—a “mental model” of the world described in the text. The reader can then fill in missing details in the model, extrapolate a scene forward or backward, or even hypothesize about counterfactual alternatives.

This sort of modeling and reasoning is precisely what automated research assistants or game characters must do—and it’s conspicuously missing from today’s systems. An NLP researcher can usually stump a state-of-the-art reading comprehension system within a few tries. One reliable technique is to probe the system’s model of the world, which can leave even the much-ballyhooed GPT-3 babbling about cycloptic blades of grass.

Imbuing automated readers with world models will require major innovations in system design, as discussed in several Theme-track submissions. But our argument is more basic: however systems are implemented, if they need to have faithful world models, then evaluations should systematically test whether they have faithful world models.

Stated so baldly, that may sound obvious, but it’s rarely done. Research groups like the Allen Institute for AI have proposed other ways to harden the evaluations, such as targeting diverse linguistic structures, asking questions that rely on multiple reasoning steps, or even just aggregating many benchmarks. Other researchers, such as Yejin Choi’s group at the University of Washington, have focused on testing common sense, which pulls in aspects of a world model. Such efforts are helpful, but they generally still focus on compiling questions that today’s systems struggle to answer.

We’re proposing a more fundamental shift: to construct more meaningful evaluations, NLP researchers should start by thoroughly specifying what a system’s world model should contain to be useful for downstream applications. We call such an account a “template of understanding.”

One particularly promising testbed for this approach is fictional stories. Original stories are information-rich, un-Googleable, and central to many applications, making them an ideal test of reading comprehension skills. Drawing on cognitive science literature about human readers, our CEO David Ferrucci has proposed a four-part template for testing an AI system’s ability to understand stories.

  • Spatial: Where is everything located and how is it positioned throughout the story?
  • Temporal: What events occur and when?
  • Causal: How do events lead mechanistically to other events?
  • Motivational: Why do the characters decide to take the actions they take?

By systematically asking these questions about all the entities and events in a story, NLP researchers can score systems’ comprehension in a principled way, probing for the world models that systems actually need.

It’s heartening to see the NLP community reflect on what’s missing from today’s technologies. We hope this thinking will lead to substantial investment not just in new algorithms, but in new and more rigorous ways of measuring machines’ comprehension. Such work may not make as many headlines, but we suspect that investment in this area will push the field forward at least as much as the next gargantuan model.

Jesse Dunietz is a researcher at Elemental Cognition, where he works on developing rigorous evaluations for reading comprehension systems. He is also an educational designer for MIT’s Communication Lab and a science writer.