


In 'Milestone' for Open Source, Meta Releases New Benchmark-Beating Llama 4 Models (meta.com) 56
It's "a milestone for Meta AI and for open source," Mark Zuckerberg said this weekend. "For the first time, the best small, mid-size, and potentially soon frontier [large-language] models will be open source."
Zuckerberg anounced four new Llama LLMs in a video posted on Instagram and Facebook — two dropping this weekend, with another two on the way. "Our goal is to build the world's leading AI, open source it, and make it universally accessible so that everyone in the world benefits."
Zuckerberg's announcement: I've said for a while that I think open source AI is going to become the leading models. And with Llama 4 this is starting to happen.
- The first model is Llama 4 Scout. It is extremely fast, natively multi-modal. It has an industry-leading "nearly infinite" 10M-token context length, and is designed to run on a single GPU. [Meta's blog post says it fits on an NVIDIA H100]. It is 17 billion parameters by 16 experts, and it is by far the highest performing small model in its class.
- The second model is Llama 4 Maverick — the workhorse. It beats GPT-4o and Gemini Flash 2 on all benchmarks. It is smaller and more efficient than DeepSeek v3, but it is still comparable on text, plus it is natively multi-modal. This one is 17B parameters x 128 experts, and it is designed to run on a single host for easy inference.
This thing is a beast.
Zuck promised more news next month on "Llama 4 Reasoning" — but the fourth model will be called Llama 4 Behemoth. "This thing is massive. More than 2 trillion parameters." (A blog post from Meta AI says it also has a 288 billion active parameter model, outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM benchmarks, and will "serve as a teacher for our new models.")
"I'm not aware of anyone training a larger model out there," Zuckberg says in his video, calling Behemoth "already the highest performing base model in the world, and it is not even done training yet."
"If you want to try Llama 4, you can use Meta AI in WhatsApp, Messenger, or Instagram Direct," Zuckberg said in his video, "or you can go to our web site at meta.ai." The Scout and Maverick models can be downloaded from llama.com and Hugging Face.
"We continue to believe that openness drives innovation," Meta AI says in their blog post, "and is good for developers, good for Meta, and good for the world." Their blog post declares it's "The beginning of a new era of natively multimodal AI innovation," calling Scout and Maverick "the best choices for adding next-generation intelligence." This is just the beginning for the Llama 4 collection. We believe that the most intelligent systems need to be capable of taking generalized actions, conversing naturally with humans, and working through challenging problems they haven't seen before. Giving Llama superpowers in these areas will lead to better products for people on our platforms and more opportunities for developers to innovate on the next big consumer and business use cases. We're continuing to research and prototype both models and products, and we'll share more about our vision at LlamaCon on April 29...
We also can't wait to see the incredible new experiences the community builds with our new Llama 4 models.
"The impressive part about Llama 4 Maverick is that with just 17B active parameters, it has scored an ELO score of 1,417 on the LMArena leaderboard," notes the tech news site Beebom. "This puts the Maverick model in the second spot, just below Gemini 2.5 Pro, and above Grok 3, GPT-4o, GPT-4.5, and more.
"It also achieves comparable results when compared to the latest DeepSeek V3 model on reasoning and coding tasks, and surprisingly, with just half the active parameters."
Zuckerberg anounced four new Llama LLMs in a video posted on Instagram and Facebook — two dropping this weekend, with another two on the way. "Our goal is to build the world's leading AI, open source it, and make it universally accessible so that everyone in the world benefits."
Zuckerberg's announcement: I've said for a while that I think open source AI is going to become the leading models. And with Llama 4 this is starting to happen.
- The first model is Llama 4 Scout. It is extremely fast, natively multi-modal. It has an industry-leading "nearly infinite" 10M-token context length, and is designed to run on a single GPU. [Meta's blog post says it fits on an NVIDIA H100]. It is 17 billion parameters by 16 experts, and it is by far the highest performing small model in its class.
- The second model is Llama 4 Maverick — the workhorse. It beats GPT-4o and Gemini Flash 2 on all benchmarks. It is smaller and more efficient than DeepSeek v3, but it is still comparable on text, plus it is natively multi-modal. This one is 17B parameters x 128 experts, and it is designed to run on a single host for easy inference.
This thing is a beast.
Zuck promised more news next month on "Llama 4 Reasoning" — but the fourth model will be called Llama 4 Behemoth. "This thing is massive. More than 2 trillion parameters." (A blog post from Meta AI says it also has a 288 billion active parameter model, outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM benchmarks, and will "serve as a teacher for our new models.")
"I'm not aware of anyone training a larger model out there," Zuckberg says in his video, calling Behemoth "already the highest performing base model in the world, and it is not even done training yet."
"If you want to try Llama 4, you can use Meta AI in WhatsApp, Messenger, or Instagram Direct," Zuckberg said in his video, "or you can go to our web site at meta.ai." The Scout and Maverick models can be downloaded from llama.com and Hugging Face.
"We continue to believe that openness drives innovation," Meta AI says in their blog post, "and is good for developers, good for Meta, and good for the world." Their blog post declares it's "The beginning of a new era of natively multimodal AI innovation," calling Scout and Maverick "the best choices for adding next-generation intelligence." This is just the beginning for the Llama 4 collection. We believe that the most intelligent systems need to be capable of taking generalized actions, conversing naturally with humans, and working through challenging problems they haven't seen before. Giving Llama superpowers in these areas will lead to better products for people on our platforms and more opportunities for developers to innovate on the next big consumer and business use cases. We're continuing to research and prototype both models and products, and we'll share more about our vision at LlamaCon on April 29...
We also can't wait to see the incredible new experiences the community builds with our new Llama 4 models.
"The impressive part about Llama 4 Maverick is that with just 17B active parameters, it has scored an ELO score of 1,417 on the LMArena leaderboard," notes the tech news site Beebom. "This puts the Maverick model in the second spot, just below Gemini 2.5 Pro, and above Grok 3, GPT-4o, GPT-4.5, and more.
"It also achieves comparable results when compared to the latest DeepSeek V3 model on reasoning and coding tasks, and surprisingly, with just half the active parameters."
Mostly useless for normal users (Score:5, Insightful)
The MoE with 17B active parameters is fast, but you still need the VRAM to load it. 109B parameters for the smallest model means that most people won't be able to use it at home. No not you, LLM Larry with 8 4090 in your "home pc", I am talking about the rest of us a single mid-range to high-end desktop card.
Being proud of a 2T model is also strange, because the future is small and powerful models and not just making things bigger and bigger. Remember when Llama3 came with its 405B large model and Mistral countered it with its 123B model announced as "large enough".
We can only hope, that Llama 4.1 will come with some distilled versions with sizes in the 8B-70B range, that are actually useful as base for open source projects. Speaking of open source, don't call the models open source as the Llama license is absolutely non-free. If you want something that is legally sound when a court should decide that model licenses are applicable, have a look at Mistral, DeepSeek, Phi models with their Apache2 and MIT licenses.
Re: (Score:3)
What fucking GPU is that, pray tell?
Are we talking about running it quantized down to 1b?
AFAIK, I've got basically the second-largest VRAM available in a single GPU (128GB M4 Max). The only step up is an M3 Ultra with 512GB.
The "active parameters" thing is confused by people a lot (and I wonder if it was confused by Fuckerberg here).
While it only needs to compute the active parameters, you don't know what experts are going to be run until yo
For large values of "GPU" (Score:2)
But the quality of said tokens is apparently not very good: the current crop of 27-32b models (Gemma 3 27b, Mistral Small, Qwen2.5 Coder) perform at least on a par with Llama 4's 109b(!) model, and CoT (i.e. QwQ @ 32b) cranks it up a notch from there.
Re: (Score:2)
I sadly cannot test as I only play with hardware affordable for normal users, but I've already read on reddit that it seems to be disappointing compared to the curret mid-range dense LLM.
How does the vision part perform? I tested Gemma 3 27B the other day and gave it the image of Fry after drinking 100 coffees and it said it is a Futurama screencap of Dr. Farnsworth in front of flames. Probably some text only LLM would know who is in the scene if I'd describe the screencap to them.
Re: (Score:2)
Which quant of Gemma 3 27b are you using? I tried what is probably the same image (Fry with metal coffee urn, surrounded by flames) with a Q5_K_M quant, and it identified Fry just fine but thought the urn was a soda pump until I questioned it. I challenged it on the Fry identification, and it stood firm.
Re: (Score:2)
I tried it on Openrouter. 27B is a bit too large for me and I wanted to test how good it is and the fry image was just some random test image that is not privacy sensitive. The first prompt gave a verbose result with a quite good scene description and the wrong character, than I used a prompt for a more concise description and it also worked, but it still claimed it would be Farnsworth.
The image was the lower part of https://i.redd.it/iybvqjlpchd5... [i.redd.it] cropped from a larger combination of memes, so not identi
Re:For large values of "GPU" (Score:4, Interesting)
Alright, let's break down this image. It’s a two-panel composition, both clearly referencing the animated series Futurama.
The top panel depicts a hand – presumably belonging to a waiter or server, judging by the cuff and dark suit sleeve – pouring coffee from a silver pot into a white mug. Next to the mug is a single, fairly large denomination bill, green in color. There’s also an empty white plate on the table. It’s a rather mundane scene, setting a conventional dining expectation.
The bottom panel is substantially different. It features Philip J. Fry, also in a waiter’s uniform – a bow tie included – but with a look of utter panic on his face. He’s standing behind a counter, and the entire background is ablaze. Flames are erupting all around him. A silver coffee pot sits calmly on the counter, almost mockingly so. A small, black cat seems unfazed in the corner, a detail adding a layer of darkly humorous incongruity. A dollar bill appears pinned to Fry's vest.
The contrast is, naturally, the key element. It strongly suggests a “before and after” scenario, implying Fry's involvement in the disastrous outcome. The comedic timing, typical of Futurama, relies on this abrupt shift from polite service to fiery chaos. It's a fairly typical depiction of Fry's incompetence and the inevitable disasters that follow him.
Do you want me to speculate on the specific episode this originates from, or are you interested in discussing its wider implications as an image? Perhaps we could compare it to other instances of comedic escalation in animation?
Maybe they are using a smaller quant or something. This is running locally on my M4 Pro/48GB Macbook Pro.
Re: (Score:2)
You'll need to give your prompt if you want it reproducible.
I used "Describe the image."
Re: (Score:2)
"Please" is a haha-only-serious joke to myself and a plea to any future AGIs to not eradicate me.)
Re: (Score:2)
I get Farnsworth with that prompt in Q8 or FP16.
Wonder if you've got a different system prompt template than stock.
Are you using someone else's quantization, or your own direct conversion of the HF model to GGUF?
Re: (Score:2)
Better say please. Think of the robot mafia.
Re: (Score:2)
System prompt:
You are Gemma 3 in an ongoing chat with (XXXX). Use natural language patterns and keep things on the educated side of casual unless the situation demands otherwise. React authentically without needing consent or showing unearned empathy. Provide coherent responses, recalling past events and evolving the chat smoothly. Maintain logical consistency and adapt when contradictions arise. Avoid repetition or summarizing unless requested. Avoid self-recrimination: correct the mistake and move on.
I did a few re-runs. I got Zoidberg in one test, but Fry otherwise.
Re: (Score:2)
Using Bartowski's quants, I still get Farnsworth for Q5_K_M, Q8, and BF16 (Converted manually to FP16, since I can't seem to run BF16 models unless I'm using MLX).
It must be very sensitive to the exact picture being used.
Re: (Score:2)
I've got the impression it recognizes the style and then guesses the character. Maybe it even recognizes Fry, but doesn't associate a name but only the show with him.
Re: (Score:2)
If that is not the case, I stand corrected- otherwise, no, they still don't fit without severe quantization.
Re: (Score:3)
I have (well, my employer has) a grace hopper with about 96GB on the GPU side. But yes, you'll need multiple of those.
But a single grace hopper with 4 hoppers could probably run the small model.
That's probably a $100k node. So... yeah, you won't have one at home most likely.
Re: (Score:2)
Re: (Score:2)
The PCIe version apparently has 80 GB, while the SXM5 version has 96 GB.
Re: (Score:3)
From TFS "[Meta's blog post says it fits on an NVIDIA H100]"
Re: (Score:3)
The model is 109B parameters, and H100s only have 80GB of VRAM.
I have been informed elsewhere that an H200 can come with 96GB of VRAM, but that doesn't change the situation any.
Re: (Score:2)
Everyone is saying that despite these bold claims the performance is poor anyway. The finetunes have historically improved the llama variants quite a bit but not a promising start.
Re: (Score:2)
So it seems there are three models here.
One with 2T parameters, one with 400B parameters, and one with 110B parameters. Assuming most of the parameters are FP16, we are talking 4TB, 800GB and 220GB.
So, it really depends what you qualify as a "normal user". Are we talking about someone's laptop? A dude in his basement? Or a small company?
On a laptop, these are unlikely to run for sure.
If you are building out of consumer/PC parts, you'll find RTX5090 are usually 32GB of VRAM. So you'll need probably 8 (maybe
Re: (Score:3)
Still too small, but it's worth noting that the "laptop"/"desktop" calculus is currently skewed by the availability of Macs and soon Strix Halo parts.
Re: (Score:2)
I haven't looked at Macs or even laptop parts recently. But there has a basic power constraint here. Where did you get a laptop with 128GB of VRAM? I'm guessing the M4 max at peak specs. That's not good VRAM; you're talking capping at about 500GB/s. I don't think you can get reasonable inference speed on such a large model with low memory bandwidth. For comparison an RTX 5090 is about 4 times that. And I still think that a single RTX 5090 is too slow.
You are not going to be able to do meaningful inference
Re: (Score:3)
I haven't looked at Macs or even laptop parts recently. But there has a basic power constraint here. Where did you get a laptop with 128GB of VRAM? I'm guessing the M4 max at peak specs. That's not good VRAM; you're talking capping at about 500GB/s.
Yup, M4 Max.
546GB/s.
That's quite good VRAM since it's system ram, and not GDDR.
I don't think you can get reasonable inference speed on such a large model with low memory bandwidth. For comparison an RTX 5090 is about 4 times that. And I still think that a single RTX 5090 is too slow.
Pure poppycock.
I get ~8t/s on 70B Q8 models, same on the lower (27B/32B) FP16s.
For a 27B Q8, something you still wouldn't be able to run on anything but a top of the line card, I get ~25t/s.
For comparison an RTX 5090 is about 4 times that. And I still think that a single RTX 5090 is too slow.
For the models you can actually run, your 5090 will indeed be ~3.11x faster due to to its higher memory bandwidth, but the difference is, you can't really run shit with quantizing it into the dirt. I can, and I can with performance that I
Re: (Score:2)
Re: (Score:2)
You seem to be talking about a 27B parameter model. As far as I understand, the smallest model the article is talking about is a 17B parameter TIMES 16 experts model. So it is a 272B parameter model.
That's 10 times the size of the Gemma 3 you are talking about. So assuming you can sustain the same performance (which is unclear, but let's go with that), we are talking 2-3 tokens per second.
Or am I missing something?
Re: (Score:2)
You seem to be talking about a 27B parameter model. As far as I understand, the smallest model the article is talking about is a 17B parameter TIMES 16 experts model. So it is a 272B parameter model.
MoE's have an active parameters count, which in this case is 17B.
The larger parameter count really only matters for VRAM, and only because you can't predict which experts will be active at any layer since the router is sequential.
That's 10 times the size of the Gemma 3 you are talking about. So assuming you can sustain the same performance (which is unclear, but let's go with that), we are talking 2-3 tokens per second.
Yet less actual numbers crunched.
And yes- if I had to crunch all the parameters, we'd be talking 2-3t/s, which would frankly suck.
Re: (Score:2)
The point of the absurdly large model is to distill the logits to smaller models. Overparameterization makes it much easier to learn the underlying function. Once the underlying function is learned, a drastically smaller model can learn the output distribution (teacher student distillation).
Re: (Score:2)
Seems they would have produced their own light-weight dense model. Llama 3 is getting pretty long in the tooth compared to things like Phi-4 and Gemma 3.
The should have tried WinAMP (Score:5, Funny)
Re: (Score:3)
Dumb question (Score:2)
Re: (Score:2)
Re: (Score:2)
no, it is not minutes in that case..
But second hand servers are pretty cheap on ebay to get 256gb or more ram and run in slowly on cpu. There things like "let it run and come back later to look" work well enough.
Re: (Score:2)
The total parameters of the model are 109B at the smallest.
You cannot know what expert is active until you reach each layer, which means all experts need to be available.
This means that at each layer, it may need to load up to 17B parameters (17GB at Q8) from the drive (or RAM, if you've got it)
This means that even if you've got enough RAM to hold the entire model, and a fast GPU, you're still looking at ~20ms per layer minimum (with a method of optimizatio
Re: (Score:2)
Store yes, run no. For these models, which are waaay too large for most hardware.
For something like Llama3 8B it should not matter. The quantized versions are 5-10 GB and loaded fully into VRAM/RAM. A sd card will only make it slower to load.
Re: (Score:2)
I've run DeepSeek V3 FP16 on my 128GB machine. It's just a really, really bad experience, even off of an SSD with 6GB/s of read bandwidth.
About 0.3t/s. Not a lot of fun. But it'll do it.
Re: (Score:2)
I would consider 128 way below average. I have 64 GB in my main PC and most people have less (and need less). I would consider 24 GB VRAM and 64 GB PC-RAM the most one can demand from normal users for playing with LLM. Which means that the sweet spot is probably around 32B with 70B as upper limit, depending on how much quantization you accept.
Re: (Score:2)
I would consider 128 way below average.
128GB is not below average in any metric, lol.
There's literally only 1 GPU in existence with more VRAM than that, and the average PC has 16GB.
If you meant way below average for "running DeepSeek", then ya- absolutely, lol. You need 671GB bare minimum to run it without quantization. Running it locally was a test whether or not the system could successfully page a 5.2x overcommit. It did. So you *can* run a large model from a disk with very little V/RAM... if you're a masochist. The limiter in that case wil
Re: (Score:2)
Can you store and/or run these models off a 512GB or 1TB MicroSD card? Not quickly, but if you're willing to wait a half a minute, can you get queries to work off of them?
If you did that the time it would take to render a single token will be the time it takes to transfer 8.5 GB of data from the SD card (assuming 4-bit quant). A token generally corresponds to 75% of a single English word of output. Likely several minutes per word.
Lisd, damned lies and benchmarks... (Score:2)
Seriously, whi is this used as a positive property? Are the people doing the reporting really that incompetent?
Re: (Score:2)
Next dumb question.
Re: (Score:2)
I see you have no actual insight into what benchmarks actually do...
I must have been asleep since forever (Score:2)
Continue ?
When was that ever true for facebook?
Correct me if I'm wrong, but aren't open source in AI models quite recent?
Re: (Score:2)
It was ~2 years ago that the weights to Llama were leaked, and they decided to go "open weights".
But you are right to give them a bit of side-eye on the idea that they "continue to believe that openness drives innovation", since Llama did start out very much not open, and the decision to change that wasn't really theirs.
Now that all being said- Meta has been a huge supporter of open source in
Re: (Score:2)
Internet. IBM PC. Just 2 examples.
The problem is that people keep falling for snake oil like free Facebook. It was never free. It's just like MAGA. Same old snake oil to fool the common folk who don't want to bother to understand. They just want to complain about how unfair they are being treated.
Re: I must have been asleep since forever (Score:2)
Re: (Score:2)
So... in the spirit of openness they'll release the training material list, right?
Re: (Score:2)
Llama 1 (with its pseudo open source release) was the model that started the hype around open models. Llama2 was also a solid release at its time. Since long before the Llamas Facebook contributes to open source AI software like pytorch. As much as I dislike Facebook, their AI department is pretty open. But you can still criticize them, because their open license is non-free. They forbid a lot of uses, demand that Llama is in the name of any derivative models, and have a revenue limit when you need to start
The Zuck is so predictable (Score:2)
Open source it to avoid any copyright litigation, and then insource it when it takes control of the market.
He's such a prick.
Re: (Score:2)
More like open source to prevent OpenAI from monopolizing the market. If the open models die out, people will need to choose between Meta, Mistral, DeepSeek and OpenAIs APIs. Probably >90% would choose OpenAI. There would also be less research they can use for free if researchers don't have access to the models.
llama sizing is a little strange (Score:2)
Scout has 109B parameters, Mavrerick 400B (both 17B active) and Behemoth 2T parameters (288B active)
Scout is too big to run on a consumer GPU yet I suppose 4 bit quant would run quite fast on a system with two 3090s or alright on just one spilling half to CPU. Would have been better if low end version of llama4 had fewer experts so more people could run it.
CPU only inference would also run alright for those with at least 64GB multi-channel DDR5. For those with at least 512GB no reason to pick Scout over M
*Open weights (Score:2)