NVIDIA releases Nemotron 3 Nano 4B, a Mamba-Transformer hybrid pruned from a 9B model, optimized for local inference on RTX GPUs and Jetson hardware.
NVIDIA released Nemotron 3 Nano 4B, a 4-billion parameter open-source model using a hybrid Mamba-Transformer architecture, pruned and distilled from Nemotron Nano 9B v2 via the Nemotron Elastic framework. The model targets edge deployment on NVIDIA Jetson (Thor/Orin Nano), DGX Spark, and RTX GPUs, claiming the lowest VRAM footprint and best TTFT in its size class under high input sequence length settings. It supports instruction following, tool use, and hallucination avoidance, and runs across Transformers, vLLM, TRT-LLM, and Llama.cpp inference stacks. The model is available now on Hugging Face with full deployment guides for Jetson via NVIDIA's Jetson AI Lab.
This is a genuinely deployable 4B model with a hybrid Mamba-Transformer architecture — not just another fine-tune. The Mamba component reduces memory bandwidth pressure at long context lengths, which is why it claims lowest TTFT under high ISL settings on an RTX 4070 with Q4_K_M quantization. It runs on Llama.cpp today, meaning zero new infrastructure required if you're already running local models.
Pull the Q4_K_M GGUF from Hugging Face and benchmark it against Phi-3.5-mini or Qwen2.5-3B on your specific tool-calling workload using Llama.cpp — if TTFT is under 200ms on your target hardware, this becomes your default edge inference model.
Go to the Nemotron 3 Nano 4B Hugging Face page, download the Q4_K_M GGUF, run: `llama-cli -m nemotron-3-nano-4b.Q4_K_M.gguf -p 'List 3 steps to reset a router. Respond in JSON.' --json-schema` and check if structured output parses correctly on first attempt.
Tags
Signals by role
Also today
Tools mentioned