LMSYS Launches 'Multimodal Arena': GPT-4 Tops the Leaderboard, But AI Still Can't Beat Humans

Don't miss the leaders from OpenAI, Chevron, Nvidia, Kaiser Permanente, and Capital One at VentureBeat Transform 2024. Gain essential insights about GenAI and grow your network during this exclusive three-day event. Learn more


LMSYS organization launched its “Multimodal Arena” today released a new scoreboard comparing the performance of AI models on vision-related tasks. The arena collected over 17,000 user preference votes in over 60 languages ​​in just two weeks, providing a glimpse into the current state of AI's visual processing capabilities.

OpenAI's GPT-4o model secured the top spot in the Multimodal Arena, with Anthropic's Claude 3.5 Sonnet and Google's Gemini 1.5 Pro following closely behind. This ranking reflects the fierce competition among tech giants to dominate the rapidly evolving field of multimodal AI.

Especially the open source model LLaVA-v1.6-34B achieved scores that are comparable to some of our own models such as Claude 3 Haiku. This development signals a potential democratization of advanced AI capabilities, potentially leveling the playing field for researchers and smaller companies that lack the resources of large tech companies.

The Scoreboard covers a wide range of tasks, from captioning images and solving math problems to understanding documents and interpreting memes. This breadth is intended to provide a holistic view of each model's visual processing capabilities, reflecting the complex demands of real-world applications.


Countdown to VB Transform 2024

Join business leaders in San Francisco from July 9-11 for our flagship AI event. Connect with peers, explore the opportunities and challenges of generative AI, and learn how to integrate AI applications into your industry. Register now


Reality check: AI still struggles with complex visual reasoning

While the Multimodal Arena provides valuable insights, but primarily measures user preference rather than objective accuracy. A more sobering picture emerges from the recently introduced CharXiv benchmarkdeveloped by researchers at Princeton University to assess AI performance in understanding graphs from scientific articles.

The CharXiv results reveal significant limitations of current AI capabilities. The best performing model, GPT-4o, achieved an accuracy of only 47.1%, while the best open-source model achieved only 29.2%. These scores pale in comparison to human performance of 80.5%, highlighting the significant gap that still exists in AI’s ability to interpret complex visual data.

This disparity highlights a crucial challenge in AI development: While models have made impressive progress in tasks like object recognition and simple image captioning, they still struggle with the nuanced reasoning and contextual understanding that humans effortlessly apply to visual information.

Bridging the gap: the next frontier in AI vision

The launch of the Multimodal Arena and insights from benchmarks such as CharXiv come at a pivotal time for the AI ​​industry. As companies rush to integrate multimodal AI capabilities into products ranging from virtual assistants to autonomous vehicles, it’s increasingly important to understand the true limits of these systems.

These benchmarks serve as a reality check that tempers the often hyperbolic claims surrounding AI capabilities. They also provide a roadmap for researchers, highlighting specific areas where improvements are needed to achieve human-level visual understanding.

The gap between AI and human performance in complex visual tasks presents both a challenge and an opportunity. It suggests that significant breakthroughs in AI architecture or training methods may be needed to achieve truly robust visual intelligence. At the same time, it opens up exciting possibilities for innovation in areas such as computer vision, natural language processing and cognitive science.

As the AI ​​community digests these findings, we can expect to see a renewed focus on developing models that can not only see the visual world, but truly understand it. The race is on to create AI systems that can match human-level understanding and perhaps one day surpass even the most complex visual reasoning tasks.

Related Posts

The best air quality monitors in 2024

We may earn revenue from the products available on this page and participate in affiliate programs. Learn more › Stan Horaczek Nothing beats a breath of fresh air, but air…

AI Company Makes Dead Celebrities Read Aloud. Listen to What It Sounds Like.

Hearing Sir Laurence Olivier's voice as he updates you on your work memos may not be as exciting as seeing the famous English actor on stage or screen, but it…

Leave a Reply

Your email address will not be published. Required fields are marked *

You Missed

The British Labour Party won a resounding election victory

  • July 6, 2024
The British Labour Party won a resounding election victory

The best air quality monitors in 2024

  • July 6, 2024
The best air quality monitors in 2024

Greece allows six-day workweek for some industries

  • July 6, 2024
Greece allows six-day workweek for some industries

Leader of Australian territory where girl was killed by crocodile says species cannot outnumber region's population

  • July 6, 2024
Leader of Australian territory where girl was killed by crocodile says species cannot outnumber region's population

Migrating starlings are not imitators

  • July 6, 2024
Migrating starlings are not imitators

Biden vows to stay in race, beat Trump at Wisconsin rally

  • July 6, 2024
Biden vows to stay in race, beat Trump at Wisconsin rally

Ways to Eat a Ten-Pack of Hot Dogs and an Eight-Pack of Hot Dog Buns Without Having Any Extra Hot Dogs Leftover

  • July 6, 2024
Ways to Eat a Ten-Pack of Hot Dogs and an Eight-Pack of Hot Dog Buns Without Having Any Extra Hot Dogs Leftover

England vs Switzerland tips, odds, lineup prediction, live stream: Where to watch Euro 2024 online and on TV?

  • July 6, 2024
England vs Switzerland tips, odds, lineup prediction, live stream: Where to watch Euro 2024 online and on TV?

Shark attacks in Florida and Texas, 4 injured: NPR

  • July 6, 2024
Shark attacks in Florida and Texas, 4 injured: NPR

Passenger complaints about air travel increased in 2023

  • July 6, 2024
Passenger complaints about air travel increased in 2023

Biden faces critical day to fend off calls for withdrawal

  • July 5, 2024
Biden faces critical day to fend off calls for withdrawal