Verda Deployment Guide

Verda provides GPU-optimized cloud infrastructure via REST API, focusing on low-latency, high-performance instances ideal for interactive LLM inference workloads.

Why Verda for Nemotron 3 Super NVFP4?

Low-latency instances: Optimized for interactive workloads with minimal overhead
API-driven provisioning: Full control via REST API for automation
Blackwell GPU support: Access to latest NVIDIA architecture for NVFP4
Flexible instance types: Choose exact GPU/VRAM configurations needed
Transparent pricing: Competitive rates for sustained workloads

Prerequisites

Verda account
Verda API key
Hugging Face Read-Only Access Token
curl or HTTP client for API calls
SSH client for instance access
Docker installed locally (for building images if needed)

Deployment Overview

Verda instances are provisioned via API, then accessed via SSH to deploy your NVFP4 inference engine. The general workflow is:

Provision Instance: Use Verda API to create a Blackwell GPU instance
Configure Instance: Install dependencies via SSH
Deploy Model: Transfer or pull NVFP4-quantized model
Run Inference Engine: Start vLLM/SGLang/TensorRT-LLM/Triton
Access Endpoint: Connect to the instance's public IP/DNS

Step-by-Step Deployment

Step 1: Provision a Blackwell GPU Instance

Use the Verda API to create an instance with Blackwell GPU support:

bash

# Export your API key
export VERDA_API_KEY="your_verda_api_key_here"

# List available GPU types (look for Blackwell/B200/H100 etc.)
curl -X GET "https://api.verda.com/v1/gpu-types" \
  -H "Authorization: Bearer $VERDA_API_KEY"

# Create an instance with Blackwell GPU
# Example: Requesting 2x H100 as proxy for Blackwell until specific Blackwell types are listed
INSTANCE_RESPONSE=$(curl -X POST "https://api.verda.com/v1/instances" \
  -H "Authorization: Bearer $VERDA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "nemotron-3-super-nvfp4",
    "gpu_type": "H100",  # Replace with actual Blackwell type when available (e.g., "B200", "DGX-Spark")
    "gpu_count": 2,
    "cpus": 16,
    "memory": 128,  # GB
    "storage": 500,  # GB NVMe
    "os": "ubuntu_24_04",
    "image": "ubuntu-24.04-base",
    "region": "us-east-1"
  }')

# Extract instance ID from response
INSTANCE_ID=$(echo "$INSTANCE_RESPONSE" | jq -r '.id')
echo "Instance ID: $INSTANCE_ID"

Step 2: Wait for Instance to be Ready

bash

# Wait for instance to reach running state
while true; do
  STATUS=$(curl -X GET "https://api.verda.com/v1/instances/$INSTANCE_ID" \
    -H "Authorization: Bearer $VERDA_API_KEY" | jq -r '.status')
  
  if [ "$STATUS" = "running" ]; then
    echo "Instance is running!"
    break
  fi
  
  echo "Waiting for instance to be ready... (current status: $STATUS)"
  sleep 15
done

# Get connection details
INSTANCE_INFO=$(curl -X GET "https://api.verda.com/v1/instances/$INSTANCE_ID" \
  -H "Authorization: Bearer $VERDA_API_KEY")

IP_ADDRESS=$(echo "$INSTANCE_INFO" | jq -r '.ip_address')
SSH_USER=$(echo "$INSTANCE_INFO" | jq -r '.default_user')
echo "Connect via: ssh $SSH_USER@$IP_ADDRESS"

Step 3: Connect and Configure the Instance

bash

# SSH into the instance (you may need to add SSH key first via Verda dashboard/API)
ssh $SSH_USER@$IP_ADDRESS

# Once connected, update system and install dependencies
sudo apt update && sudo apt upgrade -y
sudo apt install -y docker.io nvidia-docker2

# Add user to docker group (may require reboot)
sudo usermod -aG docker $USER
newgrp docker  # Activates docker group immediately

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
      sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt update
sudo apt install -y nvidia-docker2
sudo systemctl restart docker

# Test Docker with GPU
sudo docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

Step 4: Deploy Your NVFP4 Inference Engine

Choose your preferred inference engine. Here's an example with vLLM:

Option A: Pull Model at Runtime (Simplest)

bash

# Login to Hugging Face (you'll need your HF token)
huggingface-cli login  # Enter your HF token when prompted

# Create directory for models
mkdir -p ~/models && cd ~/models

# Pull the Nemotron 3 Super NVFP4 model
huggingface-cli download nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
  --repo-type model \
  --local-dir Nemotron-3-Super-120B-A12B-NVFP4

# Run vLLM container
docker run --gpus all -it --rm \
  -v $(pwd)/Nemotron-3-Super-120B-A12B-NVFP4:/models/Nemotron-3-Super-120B-A12B-NVFP4 \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model /models/Nemotron-3-Super-120B-A12B-NVFP4 \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 2 \
  --dtype auto \
  --quantization compressed-tensors \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.95 \
  --kv-cache-dtype fp8

Option B: Pre-pull Model for Faster Startup

bash

# Pull model once, then reuse
huggingface-cli download nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
  --repo-type model \
  --local-dir ~/models/Nemotron-3-Super-120B-A12B-NVFP4

# Create a reusable container command
alias run-nemotron='docker run --gpus all -it --rm \
  -v $HOME/models/Nemotron-3-Super-120B-A12B-NVFP4:/models/Nemotron-3-Super-120B-A12B-NVFP4 \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model /models/Nemotron-3-Super-120B-A12B-NVFP4 \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 2 \
  --dtype auto \
  --quantization compressed-tensors \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.95 \
  --kv-cache-dtype fp8'

# Run it
run-nemotron

Option C: Systemd Service for Persistent Operation

Create a service file for automatic startup:

bash

# Create the service file
sudo tee /etc/systemd/system/nemotron-3-super.service > /dev/null <<EOF
[Unit]
Description=Nemotron 3 Super NVFP4 Inference Server
After=network-online.target docker.service
Requires=docker.service

[Service]
Type=simple
User=$USER
WorkingDirectory=$HOME/models
ExecStart=/usr/bin/docker run --rm \\
  --gpus all \\
  -v $HOME/models/Nemotron-3-Super-120B-A12B-NVFP4:/models/Nemotron-3-Super-120B-A12B-NVFP4 \\
  -p 8000:8000 \\
  vllm/vllm-openai:latest \\
  --model /models/Nemotron-3-Super-120B-A12B-NVFP4 \\
  --host 0.0.0.0 \\
  --port 8000 \\
  --tensor-parallel-size 2 \\
  --dtype auto \\
  --quantization compressed-tensors \\
  --max-model-len 131072 \\
  --gpu-memory-utilization 0.95 \\
  --kv-cache-dtype fp8
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

# Enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable nemotron-3-super.service
sudo systemctl start nemotron-3-super.service

# Check status
sudo systemctl status nemotron-3-super.service

Step 5: Access Your Inference Endpoint

Once the server is running, access it via:

http://$IP_ADDRESS:8000/v1/completions

Example test:

bash

curl http://$IP_ADDRESS:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nemotron-3-super",
    "prompt": "Hello, my name is",
    "max_tokens": 10
  }'

Configuration Options

GPU Selection

When provisioning via API, specify the GPU type:

json

{
  "gpu_type": "B200",  // or "H100", "DGX-Spark" etc. when Blackwell types are listed
  "gpu_count": 2
}

Storage Considerations

Model Storage: ~200GB for Nemotron 3 Super NVFP4
Recommended: 500GB+ NVMe for comfortable operation
For evals/benchmarks: 1TB+ NVMe
Storage Type: Verda typically offers NVMe storage for GPU instances

Networking

Verda instances typically get a public IP address by default
Security groups/firewall rules may need to be configured to allow port 8000
Consider using a reverse proxy or VPN for production access

Alternative Inference Engines on Verda

SGLang

bash

# Install SGLang
pip install sglang[all]

# Launch server
python -m sglang.launch_server \
  --model-path /models/Nemotron-3-Super-120B-A12B-NVFP4 \
  --quantization modelopt_fp4 \
  --trust-remote-code \
  --tp 2 \
  --max-running-requests 256 \
  --host 0.0.0.0 \
  --port 30000

TensorRT-LLM

Follow the TensorRT-LLM guide to:

Quantize model with hf_ptq.py --qformat nvfp4
Build engine with trtllm-build
Deploy via Triton or directly

Triton Inference Server

Prepare TensorRT-LLM engine (as above)
Set up model repository
Run Triton container:

bash

docker run --gpus all --rm -p8000:8000 \
  -v $(pwd)/model_repository:/models \
  nvcr.io/nvidia/tritonserver:24.07-py3 \
  tritonserver --model-repository=/models

Cost Optimization Strategies

1. Right-Sizing Instances

Start with smaller configurations for development/testing
Monitor GPU utilization and adjust instance type/size
Consider using fractional GPUs if Verda offers them (less likely for LLMs)

2. Instance Lifecycle Management

Stop when not in use: Verda likely allows stopping instances (not terminating) to preserve storage while avoiding compute charges
Automated shutdown: Implement idle detection to automatically stop instances
Scheduled workloads: Run instances only during known peak usage times

3. Storage Efficiency

Use snapshots/images to preserve state without running instance
Compress model backups when storing long-term
Consider storing model in object storage and pulling to instance on startup

4. Resource Monitoring

Track GPU utilization, memory usage, and throughput
Identify over-provisioned resources
Adjust based on actual workload patterns

Troubleshooting

Common Issues

Instance Provisioning Failures
- Check API response for error messages
- Verify GPU type and quantity availability in your region
- Check account limits/quotas
Connection/SSH Issues
- Verify instance is in "running" state
- Check SSH key configuration in Verda dashboard
- Verify IP address and port (typically 22)
- Check local firewall settings
Docker/GPU Issues Inside Instance
- Verify NVIDIA drivers are installed: nvidia-smi
- Check Docker can access GPU: docker run --gpus all nvidia/cuda:12.4.1-base nvidia-smi
- Ensure user is in docker group
Model Loading Problems
- Verify Hugging Face token has read access to the model
- Check available disk space (>200GB needed)
- Ensure model was downloaded completely
Inference Engine Errors
- Check engine-specific logs
- Verify command syntax and flags
- Try reducing tensor parallel size if OOM occurs
- Ensure sufficient swap space if needed

Diagnostic Commands

bash

# Check Verda instance status
curl -X GET "https://api.verda.com/v1/instances/$INSTANCE_ID" \
  -H "Authorization: Bearer $VERDA_API_KEY"

# Test GPU access from within instance
nvidia-smi
docker run --gpus all --rm nvidia/cuda:12.4.1-base nvidia-smi

# Check model files
ls -lh /models/Nemotron-3-Super-120B-A12B-NVFP4/
du -sh /models/Nemotron-3-Super-120B-A12B-NVFP4/

# Test network connectivity
curl -I http://localhost:8000/v1/health  # Adjust endpoint as needed

Verda Deployment Guide ​

Why Verda for Nemotron 3 Super NVFP4? ​

Prerequisites ​

Deployment Overview ​

Step-by-Step Deployment ​

Step 1: Provision a Blackwell GPU Instance ​

Step 2: Wait for Instance to be Ready ​

Step 3: Connect and Configure the Instance ​

Step 4: Deploy Your NVFP4 Inference Engine ​

Option A: Pull Model at Runtime (Simplest) ​

Option B: Pre-pull Model for Faster Startup ​

Option C: Systemd Service for Persistent Operation ​

Step 5: Access Your Inference Endpoint ​

Configuration Options ​

GPU Selection ​

Storage Considerations ​

Networking ​

Alternative Inference Engines on Verda ​

SGLang ​

TensorRT-LLM ​

Triton Inference Server ​

Cost Optimization Strategies ​

1. Right-Sizing Instances ​

2. Instance Lifecycle Management ​

3. Storage Efficiency ​

4. Resource Monitoring ​

Troubleshooting ​

Common Issues ​

Diagnostic Commands ​

References ​

Verda Deployment Guide

Why Verda for Nemotron 3 Super NVFP4?

Prerequisites

Deployment Overview

Step-by-Step Deployment

Step 1: Provision a Blackwell GPU Instance

Step 2: Wait for Instance to be Ready

Step 3: Connect and Configure the Instance

Step 4: Deploy Your NVFP4 Inference Engine

Option A: Pull Model at Runtime (Simplest)

Option B: Pre-pull Model for Faster Startup

Option C: Systemd Service for Persistent Operation

Step 5: Access Your Inference Endpoint

Configuration Options

GPU Selection

Storage Considerations

Networking

Alternative Inference Engines on Verda

SGLang

TensorRT-LLM

Triton Inference Server

Cost Optimization Strategies

1. Right-Sizing Instances

2. Instance Lifecycle Management

3. Storage Efficiency

4. Resource Monitoring

Troubleshooting

Common Issues

Diagnostic Commands

References