OpenAI starts creating new benchmarks that more accurately evaluate AI models across different languages and cultures

[ad_1]

English is only spoken by about 20% of the world’s population, yet existing AI benchmarks for multilingual models are falling short. For example, MMMLU has become saturated to the point that top models are clustering near high scores, and OpenAI says this makes them a poor indicator of real progress.

Additionally, the existing multilingual benchmarks focus on translation and multiple choice tasks and don’t necessarily accurately measure how well the model understands regional context, culture, and history, OpenAI explained.

To remedy these issues, OpenAI is building new benchmarks for different languages and regions of the world, starting with India, its second largest market. The new benchmark, IndQA, will “evaluate how well AI models understand and reason about questions that matter in Indian languages, across a wide range of cultural domains.”

There are 22 official languages in India, seven of which are spoken by at least 50 million people. IndQA includes 2,278 questions across 12 different languages and 10 cultural domains, and was created with help from 261 domain experts from the country, including journalists, linguists, scholars, artists, and industry practitioners.

The languages covered include Bengali, English, Hindi, Hinglish, Kannada, Marathi, Odia, Telugu, Gujarati, Malayalam, Punjabi, and Tamil. Hinglish is a mix between English and Hindi that OpenAI decided to include to account for code-switching in conversations.

The cultural domains covered include Architecture & Design, Arts & Culture, Everyday Life, Food & Cuisine, History, Law & Ethics, Literature & Linguistics, Media & Entertainment, Religion & Spirituality, and Sports & Recreation.

According to OpenAI, each datapoint contains a culturally grounded prompt in one of the Indian languages, an English translation to make it auditable, rubric criteria for grading, and an expected answer from the domain experts.

OpenAI says that it plans to create similar benchmarks for other regions of the world, using IndQA as inspiration.

“IndQA style questions are especially valuable in languages or cultural domains that are poorly covered by existing AI benchmarks. Creating similar benchmarks to IndQA can help AI research labs learn more about languages and domains models struggle with today, and provide a north star for improvements in the future,” the company wrote in a blog post.

[ad_2]

Source link

What's Hot

Zane Maldonado LattePanda IOTA-Powered CG Deck Moves from Dream to Engineering Prototype

How Agentic AI Is Changing Network Traffic: Cisco Report

Apple’s incredible AirPods Pro 3 drop back below $200

OpenAI starts creating new benchmarks that more accurately evaluate AI models across different languages and cultures

Use Cases, Benefits & Implementation

JetBrains Launches Course Creators Program Bringing Education into the IDE

Enterprise AI Had a Default Stack, Microsoft and OpenAI Just Made It Optional |

Zane Maldonado LattePanda IOTA-Powered CG Deck Moves from Dream to Engineering Prototype

How Agentic AI Is Changing Network Traffic: Cisco Report

Apple’s incredible AirPods Pro 3 drop back below $200

A practical guide for platform teams managing shared AI deployments

Don't Miss!

Zane Maldonado LattePanda IOTA-Powered CG Deck Moves from Dream to Engineering Prototype

How Agentic AI Is Changing Network Traffic: Cisco Report

Subscribe to Updates

What's Hot

OpenAI starts creating new benchmarks that more accurately evaluate AI models across different languages and cultures

Related Posts

Subscribe to Updates