Independent tests

According to Artificial Analysis tests, Meta’s Llama 4 Maverick and Scout tests are materially lower than those provided by the Meta. While the models don’t have obvious weaknesses across general reasoning, coding and maths, they don’t outperform their popular rivals in major areas as Meta claims. Overall, Meta’s Llama 4 Maverick comes third, after DeepSeekV3 and GPT-4o. Llama 4 Scout test results put it behind Gemini 2.0 Flash-Lite and Gemini 3 27B, opposite of what the company said. It is, however, more efficient than DeepSeekV3. It also has image inputs unlike DeepSeekV3.
Context: Sunny Madra, COO/President at Groq, has acknowledged the importance of Artificial Analysis’s work.

Artificial Analysis test results

TechCrunch: According to TechCrunch, it seems the version that Meta deployed in LM Arena differs from the version that’s widely available to developers. It adds that LM Arena has never been the most reliable measure of AI model’s performance. Developers have observed that the LM Arena version use a lot of emojis and gives long-winded answers while the one in together.ai doesn’t have much slope.
Medium article: Medium collected feedback from developers on Llama 4 models. It pointed out that developers are disappointed that the models lack step-by-step reasoning tasks and lacks the conversational fluency of GPT 4o. They are also disappointed that the models don’t support audio inputs, a feature that’s increasingly becoming industry standard.
MoneyControl: MoneyControl said their test established that Llama 4 integrated in Meta AI can’t turn photos into Ghibli art, but GPT and Grok can do so. They noted that it Llama 4 generates images pretty fast though. GPT had added 1 million users following the launch of Ghibli-style AI art.
Llama 4 Scout underperforms Gemini 2.0 Flash in NYT Extended Words Connection test while Llama 4 Maverick underperforms popular models.

Julian Goldie SEO-YouTube: Llama 4 Maverick generates a code that is slightly better than that of Claude 3.7 Sonnet (min 2:08). Llama 4 struggles with step-by-step reasoning, which is not the case for DeepSeek R1 (min 5:52). Also, unlike DeepSeekR1, it fails to follow coding instructions but its Snake-game code beats that of DeepSeekR1. Gemini 2.5 is better in coding than Llama 4 Maverick (min 9:15).
Gosu-Coder-YouTube: Gemini 2.0 Flash beats Llama 4 Maverick when it comes to creating a pool and connect games, even when he uses the model direct from Meta Platforms website (min 2:22). The tester also had problems with the context window, his limit was around 500k (min 12:23). However, its speed is pretty good.
echo.hive-X: Llama 4 models perform poorly on reasoning test.
[Dr Karminski](https://www.reddit.com/r/LocalLLaMA/comments/1jsl37d/im_incredibly_disappointed_with_llama4/?rdt=38737#:~:text=in a negative direction.,results are frankly terrible %2F abysmal.)-Reddit: Llama 4 models underperform 4o, Gemini Flash, Grok 3, DeepSeek V3 and Sonnet 3.5/7 on the Kscores benchmark.

Flavio Adamo-X: He said his test established that Llama 4 underperforms Gemini 2.5 pro and GPT -4o (new) but is closer to GPT-4o (old) when it comes to coding.
Sai Nemani-X: Llama 4 Maverick ranks fourth, behind GPT 4o(new), GPT 4o, and Gemini 2.0 Flash.
Andreas Köpf-X: Llama 4 Maverick ranks fifth, behind qwq-32b when it comes to reasoning.