Gemma 4 performance enhanced on NVIDIA GPUs
Google and NVIDIA have collaborated to optimise Gemma 4 for NVIDIA GPUs. The latest additions to the Gemma 4 family of open models— spanning E2B, E4B, 26B and 31B variants — are designed for efficient deployment from edge devices to high-performance GPUs*.
Running open models like the Gemma 4 family on NVIDIA GPUs achieves optimal performance because NVIDIA Tensor Cores accelerate AI inference workloads to deliver higher throughput and lower latency for local execution. Plus, the CUDA software stack ensures broad compatibility across leading frameworks and tools, enabling new models to run efficiently from day one.
This combination allows open models like Gemma 4 to scale across a wide range of systems — from Jetson Orin Nano at the edge to RTX PCs, workstations and DGX Spark — without requiring extensive optimisation.
The new generation of compact models supports a range of tasks, including:
- Reasoning: Strong performance on complex problem-solving tasks.
- Coding: Code generation and debugging for developer workflows.
- Agents: Native support for structured tool use (function calling).
- Vision, video and audio capabilities: Enables multimodal interactions for object recognition, automated speech recognition, and document or video intelligence.
- Interleaved multimodal input: Text and images can be combined in any order within a single prompt.
- Multilingual: Out-of-the-box support for 35+ languages, pretrained on 140+ languages.
The E2B and E4B models are built for ultraefficient, low-latency inference at the edge, running completely offline with near-zero latency across many devices including Jetson Nano modules.
The 26B and 31B models are designed for high-performance reasoning and developer-centric workflows, making them well suited for agentic AI. Optimised to deliver state-of-the-art, accessible reasoning, these models run efficiently on NVIDIA RTX GPUs and DGX Spark — powering development environments, coding assistants and agent-driven workflows.
The latest Gemma 4 models are compatible with OpenClaw, allowing users to build capable local agents that draw context from personal files, applications and workflows to automate tasks.
NVIDIA has collaborated with Ollama and llama.cpp to provide the best local deployment experience for each of the Gemma 4 models. To use Gemma 4 locally, users can download Ollama to run Gemma 4 models or install llama.cpp and pair it with the Gemma 4 GGUF Hugging Face checkpoint. Additionally, Unsloth provides day-one support with optimised and quantised models for efficient local finetuning and deployment via Unsloth Studio.
Other developments in the NVIDIA open source universe include the recently-introduced NVIDIA NemoClaw, an open source stack that optimises OpenClaw experiences on NVIDIA
devices by increasing security and supporting local models.
Accomplish.ai also announced Accomplish FREE, a no-cost version of its open source desktop
AI agent with built-in models. It harnesses NVIDIA GPUs to run open
weight models locally, while a hybrid router dynamically balances
workloads between local RTX hardware and the cloud — enabling fast,
private, zero-configuration execution without requiring an application
programming interface (API) key.
Explore
Learn how to run OpenClaw for free on RTX GPUs and DGX Spark or using the DGX Spark OpenClaw playbook.
Learn more about getting started on Gemma 4 on NVIDIA GPUs at the NVIDIA technical blog.
*All configurations measured using Q4_K_M quantisations BS = 1, ISL = 4096 and OSL = 128 on NVIDIA GeForce RTX 5090 and Mac M3 Ultra desktops. Token generation throughput measured on llama.cpp b7789, using the llama-bench tool.
Comments
Post a Comment