Cloud-level quality for running agents on PCs

March 25, 2026

Source: NVIDIA. See an up-to-3.5x increase in large language model (LLM) inference performance on NVIDIA GPUs with llama.cpp. All configurations measured using Q4_K_M quantizations BS = 1, ISL = 1024 and OSL = 128 on NVIDIA RTX 5090 and Mac M3 Ultra desktops. Token generation throughput measured on llama.cpp b7789, using the llama-bench tool.

The next generation of local AI models have larger context windows, delivering the intelligence to run agents on PC. Combined with richer user context and powerful local tools, these advances are unlocking new possibilities on AI PCs.

- Nemotron 3 Super, released last week, is a 120‑billion‑parameter open model with 12 billion active parameters, designed to run complex agentic AI systems. Nemotron 3 Super is optimal for powering agents on the DGX Spark or NVIDIA RTX PRO workstations.

On PinchBench — a new benchmark for determining how well large language models perform with OpenClaw — Nemotron 3 Super scored 85.6%, making it the top open model in its class.

- Mistral Small 4, a 119-billion-parameter open model with 6 billion active parameters — 8 billion including all layers — unifies the capabilities of Mistral’s flagship models. Users now have an ultraefficient model optimised for general chat, coding and agentic tasks.

Both of these models run locally on DGX Spark, with its 128 GB of unified memory that supports models with more than 120 billion parameters, as well as RTX PRO GPUs.

For GeForce RTX users looking for smaller models, Nemotron 3 Nano 4B is the latest model to join the NVIDIA Nemotron 3 family of open models, providing a compact, capable starting point for building agents and assistants locally on RTX AI PCs.

The model is a strong fit for building action-taking conversational personas in games and apps that run on resource-constrained hardware, NVIDIA said. It is available across any NVIDIA GPU-enabled system and combines state-of-the-art instruction-following and exceptional tool use with a minimal VRAM footprint.

In addition, NVIDIA announced optimisations for Alibaba’s Qwen 3.5 models, which have demonstrated outstanding accuracy (27 B, 9 B and 4 B) and are suited for running local agents on NVIDIA GPUs. The new models natively support vision, multi-token prediction and a large 262,000-token context window. The dense 27-billion-parameter model excels when paired with an RTX 5090 GPU.

Users can try these models today via Ollama, LM Studio and llama.cpp, with accelerated inference powered by RTX GPUs and DGX Spark.

Expect faster creative AI with the latest RTX-optimised models as well. LTX 2.3, Lightricks’ state-of-the-art audio-video model, released earlier in March, now has support for NVFP4 and FP8 distilled models, accelerating performance by 2.1x.

Hashtags: #GTC, #GTC2026

Search This Blog

TechTouch Asia

Cloud-level quality for running agents on PCs

Comments

Post a Comment

Popular posts from this blog

Fortinet enhances FortiRecon to align with CTEM framework

SentinelOne recognised as a 2025 Gartner Peer Insights Customers’ Choice for XDR

AWS: AI adoption grows 20% in Singapore