Modal Deployment Guide

Modal is a serverless platform that allows you to run containerized workloads in the cloud with per-second billing. It's particularly well-suited for bursty GPU workloads like LLM inference.

Per-second billing: Only pay for actual GPU compute time
Automatic scaling: Scale from zero to hundreds of GPUs instantly
Simple Python API: Define functions and deploy with minimal configuration
Built-in secrets management: Securely handle Hugging Face tokens and API keys
Container customization: Full control over your Docker environment

Prerequisites

Modal account
Modal CLI installed
Hugging Face Read-Only Access Token (for accessing gated models)
Docker installed locally (for building custom images if needed)

Deployment Options

For rapid deployment without custom image building, use Modal's base images with NVIDIA drivers:

python

# deploy_modal.py
import modal
import os

# Create a Modal app
app = modal.App("nemotron-3-super-nvfp4")

# Define the image with necessary dependencies
image = (
    modal.Image.from_registry("nvcr.io/nvidia/pytorch:24.05-py3")
    .apt_install("git")  # Add any system dependencies needed
    .pip_install(
        "vllm",  # or "sglang", "triton", etc.
        "huggingface_hub",
        "torch",
        "transformers"
    )
    .env({
        "HF_TOKEN": os.environ["HF_TOKEN"],  # Set via modal secret
    })
)

# Mount your model volume (if using persistent storage)
# model_volume = modal.Volume.from_name("nemotron-model", create_if_missing=True)

@app.function(
    image=image,
    gpu="H100:2",  # Request 2 H100 GPUs (adjust based on your needs)
    # volumes={"/models": model_volume},  # Uncomment if using persistent volume
    secrets=[modal.Secret.from_name("hf-token")],  # Reference your secret
    timeout=3600,  # 1 hour timeout
)
@modal.web_server(8000, startup_timeout=60)
def serve():
    import subprocess
    import os
    
    # Example: Launch vLLM server
    cmd = [
        "vllm", "serve",
        "/models/Nemotron-3-Super-120B-A12B-NVFP4",
        "--host", "0.0.0.0",
        "--port", "8000",
        "--tensor-parallel-size", "2",
        "--dtype", "auto",
        "--quantization", "compressed-tensors",
        "--max-model-len", "131072",
        "--gpu-memory-utilization", "0.95",
        "--kv-cache-dtype", "fp8"
    ]
    
    # Adjust model path if using volume mount
    # cmd[3] = "/models/Nemotron-3-Super-120B-A12B-NVFP4"
    
    subprocess.Popen(cmd)
    
    # Keep the function running
    import signal
    signal.pause()

# Local testing function
@app.function(
    image=image,
    gpu="H100:2",
    secrets=[modal.Secret.from_name("hf-token")],
    timeout=300,
)
def test_inference():
    import subprocess
    import time
    
    # Start server in background
    server_process = subprocess.Popen([
        "vllm", "serve",
        "/models/Nemotron-3-Super-120B-A12B-NVFP4",
        "--host", "0.0.0.0",
        "--port", "8000",
        "--tensor-parallel-size", "2",
        "--dtype", "auto",
        "--quantization", "compressed-tensors",
        "--max-model-len", "131072",
        "--gpu-memory-utilization", "0.95",
        "--kv-cache-dtype", "fp8"
    ])
    
    # Wait for server to start
    time.sleep(30)
    
    # Test inference
    import requests
    response = requests.post(
        "http://localhost:8000/v1/completions",
        json={
            "model": "nemotron-3-super",
            "prompt": "Hello, my name is",
            "max_tokens": 10
        }
    )
    print(response.json())
    
    # Clean up
    server_process.terminate()

Option 2: Custom Docker Image (For Full Control)

If you need specific dependencies or want to optimize the image:

dockerfile

# Dockerfile
FROM nvcr.io/nvidia/pytorch:24.05-py3

# Install system dependencies
RUN apt-get update && apt-get install -y \
    git \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Set environment variables
ENV HF_TOKEN=${HF_TOKEN}

# Create non-root user (optional but recommended)
RUN useradd -m appuser
WORKDIR /home/appuser
USER appuser

# Default command (can be overridden)
CMD ["bash"]

python

# deploy_custom_modal.py
import modal
import os

app = modal.App("nemotron-3-super-custom")

# Build custom image from Dockerfile
image = modal.Image.from_dockerfile("Dockerfile")

@app.function(
    image=image,
    gpu="H100:2",
    secrets=[modal.Secret.from_name("hf-token")],
    timeout=3600,
)
@modal.web_server(8000)
def serve():
    import subprocess
    subprocess.Popen([
        "vllm", "serve",
        "/models/Nemotron-3-Super-120B-A12B-NVFP4",
        "--host", "0.0.0.0",
        "--port", "8000",
        "--tensor-parallel-size", "2",
        "--dtype", "auto",
        "--quantization", "compressed-tensors",
        "--max-model-len", "131072",
        "--gpu-memory-utilization", "0.95",
        "--kv-cache-dtype", "fp8"
    ])
    import signal
    signal.pause()

Deployment Steps

1. Set Up Secrets

bash

# Store your Hugging Face token securely
modal secret create hf-token HF_TOKEN="your_hf_token_here"

2. Deploy the Application

bash

# For the quick start version
python deploy_modal.py deploy

# For the custom image version
python deploy_custom_modal.py deploy

3. Access Your Endpoint

After deployment, Modal will provide a URL like:

https://your-workspace--nemotron-3-super-nvfp4-serve.modal.run

Use this URL in your hybridai-NVFP4 TUI or any client application.

Configuration Options

GPU Types

Modal offers various GPU options. For Blackwell/NVFP4 optimization:

"H100:2" - 2x H100 GPUs (good starting point)
"B200:1" - 1x B200 GPU (when available)
"L40S:4" - 4x L40S GPUs (alternative)

Adjust based on your tensor parallelism needs and budget.

Scaling Behavior

By default, Modal web functions scale to zero when idle. You can adjust:

python

@modal.web_server(8000, 
                  scaledown_window=300,  # Scale down after 5 min of no requests
                  min_containers=1)      # Keep at least 1 container warm

Volume Persistence

For persistent model storage (avoids re-downloading on each deployment):

python

# Create volume first
modal volume create nemotron-model

# Then use in function
model_volume = modal.Volume.from_name("nemotron-model")
# volumes={"/models": model_volume}

Cost Optimization Tips

1. Right-Size GPU Selection

Start with smaller configurations for testing
Monitor utilization and adjust GPU count/type
Consider temporal parallelism if your workload allows

2. Use Efficient Quantization

NVFP4 provides best memory efficiency for this model
FP8 KV cache further reduces memory footprint
Avoid over-provisioning due to inefficient configurations

3. Implement Smart Scaling

Set appropriate scaledown_window to balance cost vs cold start latency
Consider min_containers=0 for truly intermittent workloads
Use batch processing when possible to maximize GPU utilization

4. Monitor and Optimize

Use Modal's monitoring tools to track GPU utilization
Log inference metrics to identify bottlenecks
Adjust parameters based on actual usage patterns

Troubleshooting

Common Issues

Container Start Failures
- Check build logs for dependency issues
- Verify GPU availability in your selected region
- Ensure model path is correct
Out of Memory Errors
- Reduce tensor parallel size
- Lower --gpu-memory-utilization
- Check if model is properly quantized
Connection Problems
- Verify the port in @modal.web_server() matches your server
- Check firewall settings (though Modal handles this)
- Ensure server is binding to 0.0.0.0
Slow Cold Starts
- Optimize Docker image size
- Consider keeping minimum containers warm
- Pre-download model during image build if feasible

Modal Deployment Guide ​

Why Modal for Nemotron 3 Super NVFP4? ​

Prerequisites ​

Deployment Options ​

Option 1: Using Modal's Official Images (Recommended for Quick Start) ​

Option 2: Custom Docker Image (For Full Control) ​

Deployment Steps ​

1. Set Up Secrets ​

2. Deploy the Application ​

3. Access Your Endpoint ​

Configuration Options ​

GPU Types ​

Scaling Behavior ​

Volume Persistence ​

Cost Optimization Tips ​

1. Right-Size GPU Selection ​

2. Use Efficient Quantization ​

3. Implement Smart Scaling ​

4. Monitor and Optimize ​

Troubleshooting ​

Common Issues ​

References ​

Modal Deployment Guide

Why Modal for Nemotron 3 Super NVFP4?

Prerequisites

Deployment Options

Option 1: Using Modal's Official Images (Recommended for Quick Start)

Option 2: Custom Docker Image (For Full Control)

Deployment Steps

1. Set Up Secrets

2. Deploy the Application

3. Access Your Endpoint

Configuration Options

GPU Types

Scaling Behavior

Volume Persistence

Cost Optimization Tips

1. Right-Size GPU Selection

2. Use Efficient Quantization

3. Implement Smart Scaling

4. Monitor and Optimize

Troubleshooting

Common Issues

References