One of the selling points of Google's flagship generative AI models, Gemini 1.5 Pro and 1.5 Flash, is the amount of data they can reportedly process and analyze. In press conferences and demos, Google has repeatedly claimed that its models can accomplish previously impossible tasks thanks to their “long context,” such as summarizing multiple hundred-page documents or searching through scenes in movie footage.

But new research suggests that the models aren't actually all that good at it.

Two different studies examined how well Google's Gemini models and other models make sense of a huge amount of data — think the length of “War and Peace.” Both find that Gemini 1.5 Pro and 1.5 Flash struggle to correctly answer questions about large data sets; in a series of document-based tests, the models provided the correct answer only 40% of the time.

“While models like Gemini 1.5 Pro can technically handle long contexts, we've seen many cases indicating that the models don't really 'understand' the content,” said Marzena Karpinska, a postdoc at UMass Amherst and co-author of one of the studies, told JS.

Gemini context window is missing

A model’s context, or context window, refers to input data (e.g., text) that the model considers before generating output (e.g., additional text). A simple question — “Who won the 2020 U.S. presidential election?” — can serve as context, as can a movie script, show, or audio clip. And as context windows grow, so does the size of the documents they can fit inside.

The latest versions of Gemini can handle more than 2 million tokens as context. (“Tokens” are subdivided pieces of raw data, such as the syllables “fan”, “bag” and “tic” in the word “fantastic”.) That's equivalent to about 1.4 million words, two hours of video or 22 hours of audio – the largest context of all commercially available models.

At a briefing earlier this year, Google showed off several pre-recorded demos intended to illustrate the potential of Gemini's long-context capabilities. One had Gemini 1.5 Pro scouring the transcript of the Apollo 11 moon landing broadcast — some 402 pages long — for joke quotes, then finding a scene in the broadcast that looked like a pencil sketch.

Oriol Vinyals, vice president of research at Google DeepMind and leader of the briefing, described the model as “magical.”

“[1.5 Pro] “It performs these kinds of reasoning tasks on every page, every word,” he said.

That may have been an exaggeration.

In one of the aforementioned studies benchmarking these capabilities, Karpinska and researchers at the Allen Institute for AI and Princeton asked the models to evaluate true/false statements about fiction books written in English. The researchers chose recent works so that the models couldn't “cheat” by relying on prior knowledge, and they peppered the statements with references to specific details and plot points that would be impossible to understand without reading the books in their entirety.

Given a statement like “Using her abilities as Apoth, Nusis is able to reverse engineer the type of portal opened by the key to the reagents found in Rona's wooden chest”, Gemini 1.5 Pro and 1.5 Flash had to — who had read the relevant book — indicate whether the statement was true or false and explain their reasoning.

Image credits: UMass Amherst

Tested on a book of approximately 260,000 words (~520 pages) in length, the researchers found that 1.5 Pro correctly answered true/false statements 46.7% of the time, while Flash only answered correctly 20% of the time. That means that a coin is significantly better at answering questions about the book than Google's latest machine learning model. When averaging all the benchmark results, neither model managed to achieve accuracy higher than random chance in terms of question-answer accuracy.

“We noticed that the models have more difficulty verifying claims that require larger portions of the book, or even the entire book, compared to claims that can be resolved by retrieving sentence-level evidence,” Karpinska said. “Qualitatively, we also noticed that the models have difficulty verifying claims about implicit information that is obvious to a human reader but not explicit in the text.”

The second of the two studies, co-conducted by researchers at UC Santa Barbara, tested the ability of Gemini 1.5 Flash (but not 1.5 Pro) to “reason about” videos – that is, by searching and asking questions about their content. about to answer.

The co-authors created a dataset of images (e.g., a photo of a birthday cake) paired with questions that the model would answer about the objects depicted in the images (e.g., “What cartoon character is on this cake?”). To evaluate the models, they randomly chose one of the images and inserted “distracting” before and after images into it to create slideshow-like visuals.

Flash didn't perform as well. In a test where the model had to transcribe six handwritten digits from a “slideshow” of 25 images, Flash got about 50% of the transcriptions right. Accuracy dropped to about 30% with eight digits.

“On real question-answer tasks about images, it seems to be particularly difficult for all the models we tested,” Michael Saxon, a PhD student at UC Santa Barbara and one of the study's co-authors, told JS. “That small amount of reasoning — recognizing that a number is in a frame and reading it — could break the model.”

Google overpromises with Gemini

None of the studies have been peer-reviewed, nor do they examine the Gemini 1.5 Pro and 1.5 Flash releases with 2 million token contexts. (Both tested the 1 million token context versions.) And Flash isn't supposed to be as capable as Pro in terms of performance; Google advertises it as a cheap alternative.

Still, both add fuel to the fire that Google has overpromised — and underdelivered — with Gemini from the start. None of the models the researchers tested, including OpenAI’s GPT-4o and Anthropic’s Claude 3.5 Sonnet, performed well. But Google is the only model provider to have pushed context window to the top of the list in its ads.

“There is nothing wrong with simply stating, 'Our model can use X number of tokens,' based on the objective technical details,” Saxon said. “But the question is: what useful can you do with it?”

Generative AI, broadly speaking, is coming under increasing scrutiny as companies (and investors) become frustrated with the technology’s limitations.

In a pair of recent surveys from Boston Consulting Group, about half of respondents — all C-suite executives — said they don't expect generative AI to deliver substantial productivity gains and are concerned about the potential for errors and data compromises arising from generative AI-powered tools. PitchBook recently reported that early-stage generative AI dealmaking has declined for two consecutive quarters, falling 76% from its peak in Q3 2023.

Faced with chatbots that summarize meetings that elicit fictional details about people and AI search platforms that essentially amount to plagiarism generators, customers are looking for promising differentiators. Google – which has at times clumsily raced to catch up with its generative AI rivals – was desperate to make Gemini context one of those differentiators.

But the bet was premature, it seems.

“We haven’t found a way to actually show that there’s any ‘reasoning’ or ‘understanding’ happening over long documents, and basically every group that puts out these models is just cobbling together their own ad hoc evaluations to make these claims,” Karpinska said. “Without knowing how long context processing is implemented — and companies don’t share these details — it’s hard to say how realistic these claims are.”

Google did not respond to a request for comment.

Both Saxon and Karpinska believe the antidotes to overblown claims about generative AI are better benchmarks and, in the same vein, more emphasis on third-party criticism. Saxon notes that one of the most common tests for long context (which Google liberally cites in its marketing materials), “needle in a haystack,” only measures a model’s ability to extract certain information, like names and numbers, from datasets — it doesn’t answer complex questions about that information.

“All scientists and most engineers who use these models essentially agree that our existing benchmarking culture is broken,” Saxon said, “so it's important for the public to understand that these giant reports of numbers as 'general information about benchmarks' with a huge impact should be taken with a grain of salt.”