Google Claims Gemma 3 Reaches 98% of DeepSeek's Accuracy Using Only One GPU 58

Posted by BeauHD on Wednesday March 12, 2025 @09:03PM from the two-can-play-that-game dept.

Google says its new open-source AI model, Gemma 3, achieves nearly the same performance as DeepSeek AI's R1 while using just one Nvidia H100 GPU, compared to an estimated 32 for R1. ZDNet reports: Using "Elo" scores, a common measurement system used to rank chess and athletes, Google claims Gemma 3 comes within 98% of the score of DeepSeek's R1, 1338 versus 1363 for R1. That means R1 is superior to Gemma 3. However, based on Google's estimate, the search giant claims that it would take 32 of Nvidia's mainstream "H100" GPU chips to achieve R1's score, whereas Gemma 3 uses only one H100 GPU.

Google's balance of compute and Elo score is a "sweet spot," the company claims. In a blog post, Google bills the new program as "the most capable model you can run on a single GPU or TPU," referring to the company's custom AI chip, the "tensor processing unit." "Gemma 3 delivers state-of-the-art performance for its size, outperforming Llama-405B, DeepSeek-V3, and o3-mini in preliminary human preference evaluations on LMArena's leaderboard," the blog post relates, referring to the Elo scores. "This helps you to create engaging user experiences that can fit on a single GPU or TPU host."

Google's model also tops Meta's Llama 3's Elo score, which it estimates would require 16 GPUs. (Note that the numbers of H100 chips used by the competition are Google's estimate; DeepSeek AI has only disclosed an example of using 1,814 of Nvidia's less-powerful H800 GPUs to server answers with R1.) More detailed information is provided in a developer blog post on HuggingFace, where the Gemma 3 repository is offered.

Google Claims Gemma 3 Reaches 98% of DeepSeek's Accuracy Using Only One GPU

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 58 Comments Log In/Create an Account

Comments Filter:

I was getting all excited (Score:2, Offtopic)

by LondoMollari ( 172563 ) writes:

I was getting all excited when I thought the article was talking about Gemma Chanâ¦ turns out it's just another generic AI bot.
- Re: (Score:2)
  
  by LondoMollari ( 172563 ) writes:
  
  I guess nobody here has ever seen “Humans”
Lol, Google has AI (Score:1)

by ebunga ( 95613 ) writes:

Please clap.
I think.. impressive (Score:2)

by ndsurvivor ( 891239 ) writes:

We all saw the AI revolution coming. It will get more and more efficient. However, I would prefer that they do not suck all of the juice out of our electrical system.
- Re: (Score:2)
  
  by ihadafivedigituid ( 8391795 ) writes:
  
  Get a Mac and run it at home. My Macbook Pro consumes about 60 watts when doing inference, which is way less than the usual GPUs.
  - Re: (Score:2)
    
    by DamnOregonian ( 963763 ) writes:
    
    My M4 Max MBP generally uses between 90-100.
    Still fantastic, being it has the inference speed around that of a 220W NVidia GPU, and 16x the VRAM.
    - Re: (Score:2)
      
      by ihadafivedigituid ( 8391795 ) writes:
      
      That sounds kind of low! Seems like most people are pulling 150W+ during inference on non-binned Maxes.
      
      Excellent machines for sure. I've just been playing with Gemma 3 27b's vision functions and ... wow, the teenage Neuromancer-reading me of the mid 80s would not believe what senior menu me lived to see.
      - Re: (Score:3)
        
        by DamnOregonian ( 963763 ) writes:
        
        I've seen it spike to 150 before, but once it's settled down and is at 100% GPU (models fully loaded, inference flying along), it settles at 90-100.
        I suppose, also keep in mind that measurement is the full system power. Full brightness display is +10W, any TB peripherals being fed by your laptop can also be expensive (up to 15W each port, I think), so my baseline may be lower than others, being I'm doing this with my brightness 1 step from off, and nothing plugged in except the MagSafe.
        
        And ya, I took a p
        
        Re: (Score:2)
        
        by ihadafivedigituid ( 8391795 ) writes:
        
        This is not far off at all:
        Enhance 224 to 176. Enhance, stop. Move in, stop. Pull out, track right, stop. Center in, pull back. Stop. Track 45 right. Stop. Center and stop. Enhance 34 to 36. Pan right and pull back. Stop. Enhance 34 to 46. Pull back. Wait a minute, go right, stop. Enhance 57 to 19. Track 45 left. Stop. Enhance 15 to 23. Give me a hard copy right there.
        My binned M4 Pro/48GB gets hot enough as it is, your Max must be roasting!
        
        Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        My binned M4 Pro/48GB gets hot enough as it is, your Max must be roasting!
        It wants to. I find the system fan curve will allow it get hot enough that it starts pulling back the GPU clocks.
        I'm using Temp Monitor [vimistudios.com] to set a "boost" mode, where if it detects the average GPU core temp hit 60C or above, it cranks the fans to 100%.
        
        Can't proactively set it to full fans, because the Mac refuses any fan commands until it itself turns on its fans (since they're off when temps are reasonable).
        This is an annoying change from my M1 Max MBP, which let me set the fan to whatever I wanted whene
        
        Re: (Score:2)
        
        by AmiMoJo ( 196126 ) writes:
        
        Probably thermal throttling if the power drops off.
        I imagine a lot of AI companies are re-evaluating their expected power consumption needs as it gets more efficient. Not good news for nuke fans. It's also nice to see Google managing to compete with a Chinese firm for a change.
        
        Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        Probably thermal throttling if the power drops off.
        Nope.
        It will absolutely thermal throttle after a while, but spikes of 150 and then drops are simply because there's a lot more than the GPU that can pull lots of power.
        CPU and SSD, for example are fully loaded pulling ~6GB/s into RAM while the model is loading, etc.
        
        When it thermal throttles, the clock on the GPU starts falling off, and the 100W power drops to ~75W or so.
        This can be avoided by manually setting the fans to 100%, but the system won't actually do this by itself. The highest it'll set them
  - Re: (Score:2)
    
    by daveapriltwenty ( 1919206 ) writes:
    
    My M2 Max Studio with 64GB barely gets warm during full GPU and CPU load. It's hanging under my desk and has a ventilated bottom. I can run R1 32B q8 with over a 60K context window with the right settings. It's very steerable. There is no MLX version of Gemma 3 yet, I think, and I test most new models in LMStudio, so the one I downloaded won't run yet.
    - Re: (Score:2)
      
      by ihadafivedigituid ( 8391795 ) writes:
      
      Mac Studio has greatly superior cooling compared to our Macbook Pros. :-)
      
      I pulled a MLX 4-bit quant of Gemma 3 27B yesterday, but it wouldn't run right on LM Studio: feeding it images just gave errors, and text prompts resulted in the model barfing infinite <pad> tags.
      
      The Bartowski Q4_K_M GGUF quant runs great on my binned M4 Pro, however. Feeding it a large image results in ~10 seconds to first token and 11-12t/s, which is completely usable for my purposes. I finally have a good local vision mo
      - Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        Feeding it a large image results in ~10 seconds to first token
        They probably left the CLIP model in a format that can't be GPU offloaded.
        
        I made my own GGUFs (llama.cpp is all you need for this) from the hf safetensors to get what I needed. There's usually a bit of time before all of the GGUF publishers get around to make all the permutations you're likely to want.
    - Re: (Score:2)
      
      by DamnOregonian ( 963763 ) writes:
      
      My M2 Max Studio with 64GB barely gets warm during full GPU and CPU load.
      My M4 Max MBP with 128GB doesn't, if I force the fans to full. But for some annoying fucking reason, Apple has decided that the regular system profile will only ever push them to ~55%, and it gets quite fucking warm, then.
      I can run R1 32B q8 with over a 60K context window with the right settings.
      Kinda- R1 distilled Qwen or Lllama. Not really, R1, which is 600-something B parameters. You *can* run it (using mmap() capable llamas), but it's really, really not pretty- I get ~0.14t/s
      There is no MLX version of Gemma 3 yet, I think, and I test most new models in LMStudio, so the one I downloaded won't run yet.
      I pulled the weights directly from hf and converted to GGUF using llama.cpp, in 2 flavors: FP16 (our Ma
That's amazing! (Score:2, Funny)

by Kiliani ( 816330 ) writes:

Does this mean AI only needs a single brain cell??
- Re:That's amazing! [The third attempt is better!] (Score:2)
  
  by shanen ( 462549 ) writes:
  
  I think you're going for funny and on that basis it deserved to be FP. However I think the significance of the story is pretty close to null. LOTS of room for optimization, though the claim of second system effect is that the biggest improvement is in the second round.
personal AI (Score:3)

by ZipNada ( 10152669 ) writes: on Wednesday March 12, 2025 @10:25PM (#65229343)

AI at that level these days has generally been something on the cloud that you pay fees to access. And it presumably has the entire history of your interaction with it, which is troubling. This improvement in efficiency (assuming true) makes it a lot easier for a modest-size corporation to contemplate owning the physical AI. It will result in faster proliferation of these machines. Let's hope we survive it.

- Re: (Score:2)
  
  by ndsurvivor ( 891239 ) writes:
  
  I agree. Google is the example of a utility that spies on you. I would like an AI that is just mine. I think we are there.
  - Re: (Score:2)
    
    by DamnOregonian ( 963763 ) writes:
    
    We are there, but it's still a bit pricey- mostly due to VRAM requirements.
    There are several models that run well on machines with 128GB of VRAM, which is a budding but existing market.
- Re: (Score:2)
  
  by DamnOregonian ( 963763 ) writes:
  
  Better. An H100 has 80GB of VRAM.
  Today, you can do that on a Mac, and soon you'll be able to do it with a Strix Halo. Probably not long from now, an Intel (assuming they can read the room)
  This means not just a corporation- a person can do it.
- - Re: (Score:2)
    
    by ndsurvivor ( 891239 ) writes:
    
    a sucker? reminder that duckduckgo has anon links to AI's.
It comes within 98% of their score? (Score:2)

by outsider007 ( 115534 ) writes:

That's really not much of a claim, is it?
Come back when it's running on my iPhone6 (Score:1)

by thesjaakspoiler ( 4782965 ) writes:

Who has a H100 lying around?
- Re: (Score:3)
  
  by ndsurvivor ( 891239 ) writes:
  
  I remember a time when people said: "who has a Z80 microprocessor laying around?".... well not really, but you get the point.
  - Re: (Score:2)
    
    by DamnOregonian ( 963763 ) writes:
    
    /me glances over at his calculator
    /me slowly raises hand
    
    Learned my first assembly language on that bad boy!
  - - Re: (Score:2)
      
      by ndsurvivor ( 891239 ) writes:
      
      Of course. My respect to you.
- Re: (Score:2)
  
  by DamnOregonian ( 963763 ) writes:
  
  Ya, H100 is still a steep ask. However, there are machines with more than an H100 worth of VRAM you can get your hands on.
  M2 Max (Mac Studio, MacBook Pro), M2 Ultra (Mac Studio), M3 Max (Mac Studio, MacBook Pro), M3 Ultra (Mac Studio), M4 Max, (MacBook Pro).
  Soon Strix Halo will be available if AMD is your thing.
  
  On my M4 Max, I'm getting ~9t/s at FP16 and ~16t/s at Q8_0. Nice and usable.
  
  At lower quantizations (Q4, etc) you could run it on top-of-the-line discretes (24GB VRAM, etc)
  - Re: (Score:2)
    
    by ndsurvivor ( 891239 ) writes:
    
    It does not seem to be the RAM that makes AI work. It seems to be the trillions of integer calculations the chip can do in a second. It seems like a statistical calculation as to what the next token should be.
    - Re: (Score:2)
      
      by Bumbul ( 7920730 ) writes:
      
      It does not seem to be the RAM that makes AI work. It seems to be the trillions of integer calculations the chip can do in a second.
      
      It this is correct, then this begs a question: why would this affect accuracy of the AI results at all?
      Shouldn't it just be slower producing it (not less accurate), when run on a slower machine?
      - Re: (Score:2)
        
        by ndsurvivor ( 891239 ) writes:
        
        There is an Arduino add on for AI, it is slower. It should be as accurate as any cloud machine if you are patient. People want their AI's to to respond in seconds like a person would, I am guessing. I am not that way.
        
        Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        You could not run an LLM on an Arduino.
        Imagine how you would get it to run the calculations across the billions of parameters, while maintaining gigabytes of hidden state.
        Assuming you could create a mechanism that streamed the necessary memory when required (some kind of paging), then slow wouldn't begin to describe your experience, no matter the speed of the computing elements. You'd be limited by the memory bandwidth. You would get a token every decade.
        
        Re: (Score:2)
        
        by ndsurvivor ( 891239 ) writes:
        
        I meant to say Rapberry Pi, there is a "hat" for it that does the calculations. You are correct, Arduino is slow, but the pi seems to kick ass in my humble opinion.
        
        Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        A Pi could probably do a little better, but still abysmally slow.
        The problem isn't the Pi, in particular- it's going to be limited by how quickly the Pi can feed those gigabytes of data to the AI hat.
        For smaller AI models (not LLMs) that kind of thing is going to work just fine.
        Vision classifiers and stuff like that.
        
        Re: (Score:2)
        
        by ndsurvivor ( 891239 ) writes:
        
        I have not had the want to test it yet, but I feel it coming. Having an AI agent does seem interesting to me.
    - Re:Come back when it's running on my iPhone6 (Score:5, Informative)
      
      by DamnOregonian ( 963763 ) writes: on Thursday March 13, 2025 @01:19AM (#65229549)
      
      VRAM affects how large of a model and what precision you can do those calculations on.
      I.e., the model in question is a 27B model (27 billion parameters)
      At BF16 (its native resolution), that requires 54GB of VRAM to use the model.
      27GB if quantized down to INT8.
      
      So ya, RAM is the fundamental limiting factor for what kind of models you can run on your machine.
      
      - Re: (Score:2)
        
        by ndsurvivor ( 891239 ) writes:
        
        OK.
Compared to... (Score:4, Insightful)

by Luckyo ( 1726890 ) writes: on Thursday March 13, 2025 @02:35AM (#65229655)

Interesting point that US AI guys are no longer just comparing their work to each other.
They're now genuinely treating Chinese AI models as a benchmark. A massive break with status quo.

- Re: (Score:2)
  
  by ndsurvivor ( 891239 ) writes:
  
  I am almost glad that they spent trillions of dollars on this, it seems like .. well dollars pissed down a river.
- Re:Compared to... (Score:4, Informative)
  
  by DamnOregonian ( 963763 ) writes: on Thursday March 13, 2025 @03:00AM (#65229681)
  
  R1 is good, so I'm not surprised.
  What's really surprising, is this is a really small model. 27B. Runs fast as hell on my MBP.
  If it really does compete with R1 on anything but a couple of very specific benchmarks, that would be pretty fucking amazing.
  
  - Re: (Score:2)
    
    by HiThere ( 15173 ) writes:
    
    I'd need to know more about the tests...but it's not really surprising. It's got to have been trained on a specialized dataset rather on "everything on the internet", and small, specialized, datasets are a lot cheaper both to train and to run. But could it tell you which gnus are found in Tanzania? (Not tell correctly, but even come up with a reasonable answer.)
    (OTOH, the use of elo as the metric makes me think it was specialized for chess, but I'm guessing that this is incorrect.)
    - Re: (Score:2)
      
      by DamnOregonian ( 963763 ) writes:
      
      Amount of parameters in the model doesn't indicate the amount of tokens it was trained with.
      I.e., it could absolutely still have been trained with "the entirety of the internet".
      That isn't to say there isn't a cost to less parameters, only that we haven't formulated a good hypothesis for how many parameters are needed to fully encode knowledge.
      
      For shits and giggles, here's the answer to your question about African Gnus:
      Its markdown formatted from the LLM, and pastebin apparently doesn't format that for [pastebin.com]
      - Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        That's quantized from BF16->FP32->FP16, so it's not without some accuracy loss. Whether or not that has any measurable effect is unknown, but probably not.
- - Re: Compared to... (Score:2)
    
    by St.Creed ( 853824 ) writes:
    
    The new Huawei triple fold phone is going to blow their minds. For 3.5k you get a phone, with the entire stack integrated in house, from chip to software, that out-engineers every other phone manufacturer on the planet, including Apple.
    The sanctions have done Huawei a lot of good.
Not actually faster at scale (Score:5, Informative)

by Pinky's Brain ( 1158667 ) writes: on Thursday March 13, 2025 @05:22AM (#65229761)

For large scale access to the model it's actually more expensive to run. R1 is FP8 native and if you have 32 GPUs any way to speed up requests, there is no lack of memory. Only active parameter count matters.
That's the advantage of MoE, lots of memory for trivia, but at scale just as fast as a small model.

- Re: (Score:2)
  
  by DamnOregonian ( 963763 ) writes:
  
  You're right.
  They should do some tests at INT8 quantization and see if there's a significant performance degradation.
I tried out the 1b model (Score:2)

by drinkypoo ( 153816 ) writes:

My litmus test is asking AI to write me a poem about the Cummins in the style of e e cummings. gemma3 has done the best job so far, it actually looked vaguely like what I wanted. But then I asked it to make some changes and it changed... the title. And asked me if that was what I was going for.
Downloading the 4b model now. Once I see how much vram that uses I'll see if I can run a bigger one.
- Re: (Score:2)
  
  by DamnOregonian ( 963763 ) writes:
  
  Pretty impressive that the 1B model works even that well.
  I'm doing the 27B model at FP16, and it is astonishingly good. By far the best model this small I've used.
  At first glance, it looks like it'll handily outperform my goto, which is Llama 3.3 70B Q8_0.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

I was getting all excited (Score:2, Offtopic)

Re: (Score:2)

Lol, Google has AI (Score:1)

I think.. impressive (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

That's amazing! (Score:2, Funny)

Re:That's amazing! [The third attempt is better!] (Score:2)

personal AI (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

It comes within 98% of their score? (Score:2)

Come back when it's running on my iPhone6 (Score:1)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:Come back when it's running on my iPhone6 (Score:5, Informative)

Re: (Score:2)

Compared to... (Score:4, Insightful)

Re: (Score:2)

Re:Compared to... (Score:4, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: Compared to... (Score:2)

Not actually faster at scale (Score:5, Informative)

Re: (Score:2)

I tried out the 1b model (Score:2)

Re: (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals