Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
AI Google Open Source

Google Claims Gemma 3 Reaches 98% of DeepSeek's Accuracy Using Only One GPU 40

Google says its new open-source AI model, Gemma 3, achieves nearly the same performance as DeepSeek AI's R1 while using just one Nvidia H100 GPU, compared to an estimated 32 for R1. ZDNet reports: Using "Elo" scores, a common measurement system used to rank chess and athletes, Google claims Gemma 3 comes within 98% of the score of DeepSeek's R1, 1338 versus 1363 for R1. That means R1 is superior to Gemma 3. However, based on Google's estimate, the search giant claims that it would take 32 of Nvidia's mainstream "H100" GPU chips to achieve R1's score, whereas Gemma 3 uses only one H100 GPU.

Google's balance of compute and Elo score is a "sweet spot," the company claims. In a blog post, Google bills the new program as "the most capable model you can run on a single GPU or TPU," referring to the company's custom AI chip, the "tensor processing unit." "Gemma 3 delivers state-of-the-art performance for its size, outperforming Llama-405B, DeepSeek-V3, and o3-mini in preliminary human preference evaluations on LMArena's leaderboard," the blog post relates, referring to the Elo scores. "This helps you to create engaging user experiences that can fit on a single GPU or TPU host."

Google's model also tops Meta's Llama 3's Elo score, which it estimates would require 16 GPUs. (Note that the numbers of H100 chips used by the competition are Google's estimate; DeepSeek AI has only disclosed an example of using 1,814 of Nvidia's less-powerful H800 GPUs to server answers with R1.) More detailed information is provided in a developer blog post on HuggingFace, where the Gemma 3 repository is offered.

Google Claims Gemma 3 Reaches 98% of DeepSeek's Accuracy Using Only One GPU

Comments Filter:
  • I was getting all excited when I thought the article was talking about Gemma Chan⦠turns out it's just another generic AI bot.

  • Please clap.

  • We all saw the AI revolution coming. It will get more and more efficient. However, I would prefer that they do not suck all of the juice out of our electrical system.
    • Get a Mac and run it at home. My Macbook Pro consumes about 60 watts when doing inference, which is way less than the usual GPUs.
      • My M4 Max MBP generally uses between 90-100.
        Still fantastic, being it has the inference speed around that of a 220W NVidia GPU, and 16x the VRAM.
        • That sounds kind of low! Seems like most people are pulling 150W+ during inference on non-binned Maxes.

          Excellent machines for sure. I've just been playing with Gemma 3 27b's vision functions and ... wow, the teenage Neuromancer-reading me of the mid 80s would not believe what senior menu me lived to see.
          • I've seen it spike to 150 before, but once it's settled down and is at 100% GPU (models fully loaded, inference flying along), it settles at 90-100.
            I suppose, also keep in mind that measurement is the full system power. Full brightness display is +10W, any TB peripherals being fed by your laptop can also be expensive (up to 15W each port, I think), so my baseline may be lower than others, being I'm doing this with my brightness 1 step from off, and nothing plugged in except the MagSafe.

            And ya, I took a p
            • This is not far off at all:

              Enhance 224 to 176. Enhance, stop. Move in, stop. Pull out, track right, stop. Center in, pull back. Stop. Track 45 right. Stop. Center and stop. Enhance 34 to 36. Pan right and pull back. Stop. Enhance 34 to 46. Pull back. Wait a minute, go right, stop. Enhance 57 to 19. Track 45 left. Stop. Enhance 15 to 23. Give me a hard copy right there.

              My binned M4 Pro/48GB gets hot enough as it is, your Max must be roasting!

              • My binned M4 Pro/48GB gets hot enough as it is, your Max must be roasting!

                It wants to. I find the system fan curve will allow it get hot enough that it starts pulling back the GPU clocks.
                I'm using Temp Monitor [vimistudios.com] to set a "boost" mode, where if it detects the average GPU core temp hit 60C or above, it cranks the fans to 100%.

                Can't proactively set it to full fans, because the Mac refuses any fan commands until it itself turns on its fans (since they're off when temps are reasonable).
                This is an annoying change from my M1 Max MBP, which let me set the fan to whatever I wanted whene

  • Does this mean AI only needs a single brain cell??

    • I think you're going for funny and on that basis it deserved to be FP. However I think the significance of the story is pretty close to null. LOTS of room for optimization, though the claim of second system effect is that the biggest improvement is in the second round.

  • AI at that level these days has generally been something on the cloud that you pay fees to access. And it presumably has the entire history of your interaction with it, which is troubling. This improvement in efficiency (assuming true) makes it a lot easier for a modest-size corporation to contemplate owning the physical AI. It will result in faster proliferation of these machines. Let's hope we survive it.

    • I agree. Google is the example of a utility that spies on you. I would like an AI that is just mine. I think we are there.
      • We are there, but it's still a bit pricey- mostly due to VRAM requirements.
        There are several models that run well on machines with 128GB of VRAM, which is a budding but existing market.
    • Better. An H100 has 80GB of VRAM.
      Today, you can do that on a Mac, and soon you'll be able to do it with a Strix Halo. Probably not long from now, an Intel (assuming they can read the room)
      This means not just a corporation- a person can do it.
  • That's really not much of a claim, is it?

    • I remember a time when people said: "who has a Z80 microprocessor laying around?".... well not really, but you get the point.
    • Ya, H100 is still a steep ask. However, there are machines with more than an H100 worth of VRAM you can get your hands on.
      M2 Max (Mac Studio, MacBook Pro), M2 Ultra (Mac Studio), M3 Max (Mac Studio, MacBook Pro), M3 Ultra (Mac Studio), M4 Max, (MacBook Pro).
      Soon Strix Halo will be available if AMD is your thing.

      On my M4 Max, I'm getting ~9t/s at FP16 and ~16t/s at Q8_0. Nice and usable.

      At lower quantizations (Q4, etc) you could run it on top-of-the-line discretes (24GB VRAM, etc)
      • It does not seem to be the RAM that makes AI work. It seems to be the trillions of integer calculations the chip can do in a second. It seems like a statistical calculation as to what the next token should be.
        • by Bumbul ( 7920730 )

          It does not seem to be the RAM that makes AI work. It seems to be the trillions of integer calculations the chip can do in a second.

          It this is correct, then this begs a question: why would this affect accuracy of the AI results at all?
          Shouldn't it just be slower producing it (not less accurate), when run on a slower machine?

          • There is an Arduino add on for AI, it is slower. It should be as accurate as any cloud machine if you are patient. People want their AI's to to respond in seconds like a person would, I am guessing. I am not that way.
            • You could not run an LLM on an Arduino.
              Imagine how you would get it to run the calculations across the billions of parameters, while maintaining gigabytes of hidden state.
              Assuming you could create a mechanism that streamed the necessary memory when required (some kind of paging), then slow wouldn't begin to describe your experience, no matter the speed of the computing elements. You'd be limited by the memory bandwidth. You would get a token every decade.
              • I meant to say Rapberry Pi, there is a "hat" for it that does the calculations. You are correct, Arduino is slow, but the pi seems to kick ass in my humble opinion.
                • A Pi could probably do a little better, but still abysmally slow.
                  The problem isn't the Pi, in particular- it's going to be limited by how quickly the Pi can feed those gigabytes of data to the AI hat.
                  For smaller AI models (not LLMs) that kind of thing is going to work just fine.
                  Vision classifiers and stuff like that.
                  • I have not had the want to test it yet, but I feel it coming. Having an AI agent does seem interesting to me.
        • VRAM affects how large of a model and what precision you can do those calculations on.
          I.e., the model in question is a 27B model (27 billion parameters)
          At BF16 (its native resolution), that requires 54GB of VRAM to use the model.
          27GB if quantized down to INT8.

          So ya, RAM is the fundamental limiting factor for what kind of models you can run on your machine.
  • Interesting point that US AI guys are no longer just comparing their work to each other.

    They're now genuinely treating Chinese AI models as a benchmark. A massive break with status quo.

    • I am almost glad that they spent trillions of dollars on this, it seems like .. well dollars pissed down a river.
    • R1 is good, so I'm not surprised.
      What's really surprising, is this is a really small model. 27B. Runs fast as hell on my MBP.
      If it really does compete with R1 on anything but a couple of very specific benchmarks, that would be pretty fucking amazing.
  • For large scale access to the model it's actually more expensive to run. R1 is FP8 native and if you have 32 GPUs any way to speed up requests, there is no lack of memory. Only active parameter count matters.

    That's the advantage of MoE, lots of memory for trivia, but at scale just as fast as a small model.

Nobody said computers were going to be polite.

Working...