Appearance
TensorRT-LLM Deployment Guide
TensorRT-LLM is NVIDIA's highly optimized inference engine for LLMs, designed to maximize performance on Blackwell architecture GPUs with features like PagedAttention, continuous batching, and FP8 precision.
NVFP4 Deployment Process
Deploying Nemotron 3 Super with TensorRT-LLM involves two main steps:
- Quantization: Convert the PyTorch model to NVFP4 format using TensorRT Model Optimizer
- Engine Build: Compile the quantized model into an optimized TensorRT engine
Step 1: Quantization with TensorRT Model Optimizer
First, quantize the model to NVFP4 format:
bash
python hf_ptq.py \
--pyt_ckpt_path /models/Nemotron-3-Super-120B-A12B-NVFP4 \
--qformat nvfp4 \
--export_fmt tensorrt_llm \
--output_dir ./nemotron-3-super-120b-nvfp4Quantization Parameters
| Parameter | Value | Description |
|---|---|---|
--pyt_ckpt_path | /path/to/model | Path to the original PyTorch model |
--qformat | nvfp4 | Specify NVFP4 quantization format |
--export_fmt | tensorrt_llm | Export format for TensorRT-LLM |
--output_dir | ./nemotron-3-super-120b-nvfp4 | Directory to save quantized model |
Step 2: Building the TensorRT Engine
Next, build the optimized inference engine:
bash
trtllm-build \
--checkpoint_dir ./nemotron-3-super-120b-nvfp4 \
--engine_dir ./nemotron-3-super-120b-nvfp4-engine \
--gpt_attention_plugin fp8 \
--kv_cache_mode fp8 \
--max_batch_size 8 \
--max_input_len 1024 \
--max_output_len 1024Key Build Parameters
| Parameter | Value | Description |
|---|---|---|
--checkpoint_dir | ./nemotron-3-super-120b-nvfp4 | Directory with quantized model |
--engine_dir | ./nemotron-3-super-120b-nvfp4-engine | Output directory for engine |
--gpt_attention_plugin | fp8 | Use FP8 for attention plugin |
--kv_cache_mode | fp8 | Enable FP8 KV cache |
--max_batch_size | 8 | Maximum batch size (adjust based on use case) |
--max_input_len | 1024 | Maximum input sequence length |
--max_output_len | 1024 | Maximum output sequence length |
Deployment with Triton Inference Server (Recommended for Production)
For scalable deployment, use TensorRT-LLM backend with Triton Inference Server:
1. Prepare Model Repository
model_repository/
└── nemotron_3_super/
├── config.pbtxt
└── 1/
├── model.engine
└── model.json2. Example config.pbtxt
protobuf
name: "nemotron_3_super"
platform: "tensorrt_llm_plan"
max_batch_size: 8
input [
{
name: "input_ids"
data_type: TYPE_INT32
dims: [ -1 ]
},
{
name: "request_output_len"
data_type: TYPE_UINT32
dims: [ 1 ]
},
{
name: "runtime_top_k"
data_type: TYPE_UINT32
dims: [ 1 ]
},
{
name: "runtime_top_p"
data_type: TYPE_FP32
dims: [ 1 ]
},
{
name: "temperature"
data_type: TYPE_FP32
dims: [ 1 ]
}
]
output [
{
name: "output_ids"
data_type: TYPE_INT32
dims: [ -1 ]
},
{
name: "sequence_length"
data_type: TYPE_UINT32
dims: [ 1 ]
},
{
name: "cum_log_probs"
data_type: TYPE_FP32
dims: [ -1 ]
},
{
name: "log_probs"
data_type: TYPE_FP32
dims: [ -1 ]
}
]
parameters: {
key: "model_engine_dir"
value: {
string_value: "/models/nemotron-3-super-120b-nvfp4-engine"
}
}3. Run Triton Server
bash
docker run --gpus all --rm -p8000:8000 -p8001:8001 -p8002:8002 \
-v $(pwd)/model_repository:/models \
nvcr.io/nvidia/tritonserver:24.07-py3 \
tritonserver --model-repository=/modelsPerformance Optimization Tips
Quantization Considerations
- NVFP4 Format: Provides 4x memory reduction vs FP16 with minimal accuracy loss for many tasks
- Calibration: The quantization process uses a calibration dataset (if provided) to optimize scaling factors
- Validation: Always validate quantized model accuracy for your specific use case
Engine Build Tuning
- Batch Size: Adjust
--max_batch_sizebased on your latency/throughput requirements - Sequence Lengths: Set
--max_input_lenand--max_output_lento match your application needs - Plugin Selection: FP8 plugins provide best performance on Blackwell GPUs
Resource Allocation
- GPU Memory: TensorRT-LLM engines are typically smaller than the original model due to quantization
- Concurrent Requests: Use Triton's dynamic batching to maximize GPU utilization
- Memory Pool: Adjust tensorrt memory pool size if needed for very long contexts
Verification
After deployment with Triton, test the endpoint:
bash
curl -X POST localhost:8000/v2/models/nemotron_3_super/infer \
-H "Content-Type: application/json" \
-d '{
"inputs": [
{
"name": "input_ids",
"shape": [1, 5],
"datatype": "INT32",
"data": [[128006, 29892, 29958, 29947, 29915]]
},
{
"name": "request_output_len",
"shape": [1],
"datatype": "UINT32",
"data": [10]
}
]
}'Troubleshooting
- Quantization Errors: Ensure you have the latest TensorRT Model Optimizer and that the model is supported
- Engine Build Failures: Check GPU memory availability and try reducing batch size or sequence lengths
- Triton Connection Issues: Verify the model repository structure and config.pbtxt syntax
- Performance Issues: Profile with
nvtxranges and TensorBoard to identify bottlenecks