SimplePod Deployment Guide

SimplePod provides Kubernetes-based GPU pod provisioning with transparent pricing and direct access to bare-metal performance. It's ideal for predictable, sustained LLM workloads.

Why SimplePod for Nemotron 3 Super NVFP4?

Bare-metal performance: No virtualization overhead
Transparent pricing: Per-second billing with no hidden fees
Kubernetes-native: Familiar deployment workflow for containerized workloads
Direct GPU access: Full utilization of Blackwell GPU capabilities
Simple CLI: Easy pod management and monitoring

Prerequisites

SimplePod account
SimplePod CLI installed
Hugging Face Read-Only Access Token
kubectl configured (SimplePod provides kubeconfig)
Docker installed (for building custom images if needed)

Deployment Options

Option 1: Using Official Images with Environment Variables

Create a pod specification that uses official inference engine images with configuration via environment variables:

yaml

# nemotron-simplepod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: nemotron-3-super-vllm
  labels:
    app: nemotron-3-super
spec:
  restartPolicy: Never
  containers:
  - name: vllm-server
    image: vllm/vllm-openai:latest
    args:
      - --model
      - /models/Nemotron-3-Super-120B-A12B-NVFP4
      - --host
      - "0.0.0.0"
      - --port
      - "8000"
      - --tensor-parallel-size
      - "2"
      - --dtype
      - "auto"
      - --quantization
      - "compressed-tensors"
      - --max-model-len
      - "131072"
      - --gpu-memory-utilization
      - "0.95"
      - --kv-cache-dtype
      - "fp8"
    env:
    - name: HF_TOKEN
      valueFrom:
        secretKeyRef:
          name: hf-secret
          key: token
    resources:
      limits:
        nvidia.com/gpu: 2  # Request 2 GPUs for tensor parallel size 2
    volumeMounts:
    - name: model-storage
      mountPath: /models
    ports:
    - containerPort: 8000
  volumes:
  - name: model-storage
    persistentVolumeClaim:
      claimName: nemotron-model-pvc  # Pre-created PVC with model data
---
# Optional: Service for easier access
apiVersion: v1
kind: Service
metadata:
  name: nemotron-3-super-service
spec:
  selector:
    app: nemotron-3-super
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8000
  type: LoadBalancer

Option 2: Custom Image with Pre-loaded Model

For faster startup times, bake the model into a custom image:

dockerfile

# Dockerfile.nemotron
FROM vllm/vllm-openai:latest

# Set environment variables for Hugging Face access
ARG HF_TOKEN
ENV HF_TOKEN=${HF_TOKEN}

# Create model directory
RUN mkdir -p /models

# Download model during build (requires build-time secret)
# NOTE: For security, consider using volume mounts instead in production
RUN huggingface-cli download nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
    --repo-type model \
    --local-dir /models/Nemotron-3-Super-120B-A12B-NVFP4 \
    --token $HF_TOKEN

# Expose the port
EXPOSE 8000

# Default command
CMD ["vllm", "serve", \
     "/models/Nemotron-3-Super-120B-A12B-NVFP4", \
     "--host", "0.0.0.0", \
     "--port", "8000", \
     "--tensor-parallel-size", "2", \
     "--dtype", "auto", \
     "--quantization", "compressed-tensors", \
     "--max-model-len", "131072", \
     "--gpu-memory-utilization", "0.95", \
     "--kv-cache-dtype", "fp8"]

Then reference it in your pod spec:

yaml

containers:
- name: vllm-server
  image: your-registry/nemotron-3-super:v1.0
  # ... rest same as above

Option 3: Using Persistent Volume for Model Storage

Recommended approach for balancing startup time and flexibility:

yaml

# nemotron-simplepod-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nemotron-model-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 500Gi  # Adjust based on your needs
  storageClassName: simplepod-standard  # or your preferred storage class
---
apiVersion: v1
kind: Pod
metadata:
  name: nemotron-3-super-pvc
spec:
  restartPolicy: Never
  containers:
  - name: vllm-server
    image: vllm/vllm-openai:latest
    args:
      - --model
      - /models/Nemotron-3-Super-120B-A12B-NVFP4
      - --host
      - "0.0.0.0"
      - --port
      - "8000"
      - --tensor-parallel-size
      - "2"
      - --dtype
      - "auto"
      - --quantization
      - "compressed-tensors"
      - --max-model-len
      - "131072"
      - --gpu-memory-utilization
      - "0.95"
      - --kv-cache-dtype
      - "fp8"
    env:
    - name: HF_TOKEN
      valueFrom:
        secretKeyRef:
          name: hf-secret
          key: token
    resources:
      limits:
        nvidia.com/gpu: 2
    volumeMounts:
    - name: model-storage
      mountPath: /models
  volumes:
  - name: model-storage
    persistentVolumeClaim:
      claimName: nemotron-model-pvc

Deployment Steps

1. Prepare Hugging Face Secret

bash

# Create a secret for your HF token
kubectl create secret generic hf-secret \
  --from-literal=token="your_hf_token_here"

2. Create Persistent Volume Claim (if using PVC approach)

bash

kubectl apply -f nemotron-simplepod-pvc.yaml

3. Upload Model to Storage (if using PVC)

You'll need to initially populate the PVC with your model files. This can be done via:

A temporary pod that copies data to the PVC
SimplePod's file transfer mechanisms
Pre-signed URLs if your storage provider supports them

4. Deploy the Pod

bash

kubectl apply -f nemotron-simplepod.yaml

5. Access the Service

If you created the LoadBalancer service:

bash

# Get the external IP
kubectl get service nemotron-3-super-service

# Then access via http://<EXTERNAL-IP>

Configuration Options

GPU Resource Requests

Adjust based on your tensor parallelism needs:

yaml

resources:
  limits:
    nvidia.com/gpu: 2  # For tensor parallel size 2

For tensor parallel size 4: nvidia.com/gpu: 4 For tensor parallel size 1: nvidia.com/gpu: 1 (requires sufficient single GPU memory)

Model Path Adjustments

If your model is stored in a different location within the volume:

yaml

args:
  - --model
  - /mnt/models/Nemotron-3-Super-120B-A12B-NVFP4  # Adjust path
  # ... rest of args

Alternative Inference Engines

Simply change the image and args:

SGLang: lmsys/sglang:latest with appropriate launch command
TensorRT-LLM: Custom image with pre-built engine
Triton: nvcr.io/nvidia/tritonserver:24.07-py3 with model repository

Scaling Options

Single Pod (For Development/Test)

As shown above - good for testing and low-traffic scenarios.

Deployment for High Availability

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nemotron-3-super-deployment
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nemotron-3-super
  template:
    metadata:
      labels:
        app: nemotron-3-super
    spec:
      containers:
      - name: vllm-server
        # ... same container spec as in pod
      # Add pod anti-affinity for better distribution
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - nemotron-3-super
            topologyKey: "kubernetes.io/hostname"

Horizontal Pod Autoscaler

yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: nemotron-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nemotron-3-super-deployment
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: gpu
      target:
        type: Utilization
        averageUtilization: 80

Cost Optimization Strategies

1. Right-Sizing Instances

Start with minimal GPU allocation for testing
Monitor utilization via SimplePod dashboard or kubectl top
Adjust based on actual workload patterns
Consider using heterogeneous node pools if available

2. Spot Instances / Preemptible Nodes

SimplePod may offer spot instances - check their documentation:

Significantly lower cost
Suitable for fault-tolerant workloads
Implement checkpointing for long-running inference jobs

3. Efficient Resource Utilization

Use tensor parallelism to distribute model across multiple GPUs
Adjust batch size and max_num_seqs based on latency requirements
Monitor GPU memory usage and adjust utilization factor

4. Storage Optimization

Use appropriate storage class for your access pattern
Consider caching frequently accessed model layers
Archive older model versions if doing frequent updates

Monitoring and Logging

Basic Monitoring

bash

# View pod logs
kubectl logs -f nemotron-3-super-vllm

# Describe pod for events and resource usage
kubectl describe pod nemotron-3-super-vllm

# Top command for resource usage
kubectl top pod nemotron-3-super-vllm

Advanced Monitoring

SimplePod may integrate with:

Prometheus/Grafana for metrics
ELK stack for logs
Custom monitoring solutions

Check SimplePod documentation for available observability features.

Troubleshooting

Common Issues

Pod Pending State
- Check events: kubectl describe pod <name>
- Common causes: insufficient GPU quota, PVC binding issues
- Solution: Check quota with SimplePod dashboard or support
Container Crashing
- Check logs: kubectl logs <pod-name>
- Common causes: missing HF token, incorrect model path, OOM
- Solution: Fix configuration and redeploy
Out of Memory (OOM)
- Look for OOMKilled in pod status
- Solutions:
  - Reduce --tensor-parallel-size
  - Lower --gpu-memory-utilization (e.g., 0.9 → 0.8)
  - Check if model is properly quantized
  - Increase GPU allocation if possible
Connection Issues
- Verify service type and ports
- Check if LoadBalancer has external IP assigned
- Test port forwarding: kubectl port-forward pod/nemotron-3-super-vllm 8000:8000

Diagnostic Commands

bash

# Check PVC status
kubectl get pvc nemotron-model-pvc

# Check node GPU allocation
kubectl describe nodes | grep -A 10 -B 5 "nvidia.com/gpu"

# Check events for namespace
kubectl get events --sort-by='.lastTimestamp'

SimplePod Deployment Guide ​

Why SimplePod for Nemotron 3 Super NVFP4? ​

Prerequisites ​

Deployment Options ​

Option 1: Using Official Images with Environment Variables ​

Option 2: Custom Image with Pre-loaded Model ​

Option 3: Using Persistent Volume for Model Storage ​

Deployment Steps ​

1. Prepare Hugging Face Secret ​

2. Create Persistent Volume Claim (if using PVC approach) ​

3. Upload Model to Storage (if using PVC) ​

4. Deploy the Pod ​

5. Access the Service ​

Configuration Options ​

GPU Resource Requests ​

Model Path Adjustments ​

Alternative Inference Engines ​

Scaling Options ​

Single Pod (For Development/Test) ​

Deployment for High Availability ​

Horizontal Pod Autoscaler ​

Cost Optimization Strategies ​

1. Right-Sizing Instances ​

2. Spot Instances / Preemptible Nodes ​

3. Efficient Resource Utilization ​

4. Storage Optimization ​

Monitoring and Logging ​

Basic Monitoring ​

Advanced Monitoring ​

Troubleshooting ​

Common Issues ​

Diagnostic Commands ​

References ​

SimplePod Deployment Guide

Why SimplePod for Nemotron 3 Super NVFP4?

Prerequisites

Deployment Options

Option 1: Using Official Images with Environment Variables

Option 2: Custom Image with Pre-loaded Model

Option 3: Using Persistent Volume for Model Storage

Deployment Steps

1. Prepare Hugging Face Secret

2. Create Persistent Volume Claim (if using PVC approach)

3. Upload Model to Storage (if using PVC)

4. Deploy the Pod

5. Access the Service

Configuration Options

GPU Resource Requests

Model Path Adjustments

Alternative Inference Engines

Scaling Options

Single Pod (For Development/Test)

Deployment for High Availability

Horizontal Pod Autoscaler

Cost Optimization Strategies

1. Right-Sizing Instances

2. Spot Instances / Preemptible Nodes

3. Efficient Resource Utilization

4. Storage Optimization

Monitoring and Logging

Basic Monitoring

Advanced Monitoring

Troubleshooting

Common Issues

Diagnostic Commands

References