Model Performance Benchmarking

8don MSN

Google releases Gemini 3.1 Pro: Benchmark performance, how to try it

Google says that its most advanced thinking model yet outperforms Claude and ChatGPT on Humanity's Last Exam and other key ...

InfoQ

Hugging Face Introduces Community Evals for Transparent Model Benchmarking

Hugging Face has launched Community Evals, a feature that enables benchmark datasets on the Hub to host their own leaderboards and automatically collect evaluation results from model repositories.

10don MSN

Anthropic releases Claude Sonnet 4.6: Benchmark performance, how to try it

Anthropic's latest flagship model, Claude Sonnet 4.6, is out now.

Decrypt

OpenAI Says Benchmark Used to Measure AI Coding Skill Is 'Contaminated'—Here's Why

OpenAI wants to retire the leading AI coding benchmark—and the reasons reveal a deeper problem with how the whole industry measures itself.

Google’s Latest Gemini 3.1 Pro Model Is a Benchmark Beast

Google just released its most capable Gemini 3.1 Pro AI model that beats all frontier models on Humanity's Last Exam and ...

eWeek

Small Model Benchmark Battle: Mistral Takes on Gemma 3 & GPT-4o mini

eSpeaks’ Corey Noles talks with Rob Israch, President of Tipalti, about what it means to lead with Global-First Finance and how companies can build scalable, compliant operations in an increasingly ...

insideHPC

MLCommons Releases MLPerf Inference v5.0 Benchmark Results

Today, MLCommons announced new results for its MLPerf Inference v5.0 benchmark suite, which delivers machine learning (ML) system performance benchmarking. The rorganization said the esults highlight ...

Business Wire

iAsk AI Outperforms OpenAI’s o1 Model in Comprehensive Generative AI Benchmark Test

CHICAGO--(BUSINESS WIRE)--iAsk, a Generative AI-powered answer engine designed for Gen Z, today announced that iAsk Pro, its most advanced model, has surpassed both human experts and the OpenAI o1 ...

Hosted on MSN

Yann LeCun: Meta 'fudged a little bit' when benchmark-testing Llama 4 model

Yann LeCun, Meta’s outgoing chief AI scientist, says his employer tested its latest Llama model in a way that may have made the model look better than it really was. In a recent Financial Times ...

InfoWorld

New AI benchmarking tools evaluate real world performance

Now open source, xbench uses an ever changing evaluation mechanism to look at an AI model's ability to execute real-world tasks and make it harder for model makers to train on the tests. A new AI ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results