Skip to content

Inference Engines Overview

To maximize performance, parallelization, and context window sizes on Blackwell GPUs, Nemotron 3 Super is optimized for four primary inference engines with native NVFP4 support.

Each engine has specific flags and configurations for optimal NVFP4 performance. Choose the engine that best fits your use case:

Engine Comparison

EngineNVFP4 FlagKV CacheBest For
vLLM--quantization compressed-tensors--kv-cache-dtype fp8High-throughput serving, OpenAI API compatibility
SGLang--quantization modelopt_fp4AutoComplex prompting, structured outputs
TensorRT-LLMBuild with --qformat nvfp4--kv_cache_mode fp8Maximum performance, NVIDIA-native optimization
TritonTensorRT-LLM backendVia backendScalable production deployments, dynamic batching

Getting Started

Select an engine below for detailed setup instructions:

  • vLLM - Industry standard for PagedAttention and continuous batching
  • SGLang - High-performance serving with strong structured output support
  • TensorRT-LLM - NVIDIA's native optimized inference engine
  • Triton - Scalable production-grade model serving

Hardware Requirements

All engines require:

  • Blackwell Architecture GPU (RTX PRO 6000, B200/B300, GB200/GB300, DGX Spark)
  • CUDA 12.9+ or 13.x for native NVFP4 support
  • Docker with NVIDIA Container Toolkit
  • ~200GB+ NVMe storage for model + working space