Appearance
SimplePod Deployment Guide
SimplePod provides Kubernetes-based GPU pod provisioning with transparent pricing and direct access to bare-metal performance. It's ideal for predictable, sustained LLM workloads.
Why SimplePod for Nemotron 3 Super NVFP4?
- Bare-metal performance: No virtualization overhead
- Transparent pricing: Per-second billing with no hidden fees
- Kubernetes-native: Familiar deployment workflow for containerized workloads
- Direct GPU access: Full utilization of Blackwell GPU capabilities
- Simple CLI: Easy pod management and monitoring
Prerequisites
- SimplePod account
- SimplePod CLI installed
- Hugging Face Read-Only Access Token
- kubectl configured (SimplePod provides kubeconfig)
- Docker installed (for building custom images if needed)
Deployment Options
Option 1: Using Official Images with Environment Variables
Create a pod specification that uses official inference engine images with configuration via environment variables:
yaml
# nemotron-simplepod.yaml
apiVersion: v1
kind: Pod
metadata:
name: nemotron-3-super-vllm
labels:
app: nemotron-3-super
spec:
restartPolicy: Never
containers:
- name: vllm-server
image: vllm/vllm-openai:latest
args:
- --model
- /models/Nemotron-3-Super-120B-A12B-NVFP4
- --host
- "0.0.0.0"
- --port
- "8000"
- --tensor-parallel-size
- "2"
- --dtype
- "auto"
- --quantization
- "compressed-tensors"
- --max-model-len
- "131072"
- --gpu-memory-utilization
- "0.95"
- --kv-cache-dtype
- "fp8"
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: token
resources:
limits:
nvidia.com/gpu: 2 # Request 2 GPUs for tensor parallel size 2
volumeMounts:
- name: model-storage
mountPath: /models
ports:
- containerPort: 8000
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: nemotron-model-pvc # Pre-created PVC with model data
---
# Optional: Service for easier access
apiVersion: v1
kind: Service
metadata:
name: nemotron-3-super-service
spec:
selector:
app: nemotron-3-super
ports:
- protocol: TCP
port: 80
targetPort: 8000
type: LoadBalancerOption 2: Custom Image with Pre-loaded Model
For faster startup times, bake the model into a custom image:
dockerfile
# Dockerfile.nemotron
FROM vllm/vllm-openai:latest
# Set environment variables for Hugging Face access
ARG HF_TOKEN
ENV HF_TOKEN=${HF_TOKEN}
# Create model directory
RUN mkdir -p /models
# Download model during build (requires build-time secret)
# NOTE: For security, consider using volume mounts instead in production
RUN huggingface-cli download nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
--repo-type model \
--local-dir /models/Nemotron-3-Super-120B-A12B-NVFP4 \
--token $HF_TOKEN
# Expose the port
EXPOSE 8000
# Default command
CMD ["vllm", "serve", \
"/models/Nemotron-3-Super-120B-A12B-NVFP4", \
"--host", "0.0.0.0", \
"--port", "8000", \
"--tensor-parallel-size", "2", \
"--dtype", "auto", \
"--quantization", "compressed-tensors", \
"--max-model-len", "131072", \
"--gpu-memory-utilization", "0.95", \
"--kv-cache-dtype", "fp8"]Then reference it in your pod spec:
yaml
containers:
- name: vllm-server
image: your-registry/nemotron-3-super:v1.0
# ... rest same as aboveOption 3: Using Persistent Volume for Model Storage
Recommended approach for balancing startup time and flexibility:
yaml
# nemotron-simplepod-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nemotron-model-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 500Gi # Adjust based on your needs
storageClassName: simplepod-standard # or your preferred storage class
---
apiVersion: v1
kind: Pod
metadata:
name: nemotron-3-super-pvc
spec:
restartPolicy: Never
containers:
- name: vllm-server
image: vllm/vllm-openai:latest
args:
- --model
- /models/Nemotron-3-Super-120B-A12B-NVFP4
- --host
- "0.0.0.0"
- --port
- "8000"
- --tensor-parallel-size
- "2"
- --dtype
- "auto"
- --quantization
- "compressed-tensors"
- --max-model-len
- "131072"
- --gpu-memory-utilization
- "0.95"
- --kv-cache-dtype
- "fp8"
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: token
resources:
limits:
nvidia.com/gpu: 2
volumeMounts:
- name: model-storage
mountPath: /models
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: nemotron-model-pvcDeployment Steps
1. Prepare Hugging Face Secret
bash
# Create a secret for your HF token
kubectl create secret generic hf-secret \
--from-literal=token="your_hf_token_here"2. Create Persistent Volume Claim (if using PVC approach)
bash
kubectl apply -f nemotron-simplepod-pvc.yaml3. Upload Model to Storage (if using PVC)
You'll need to initially populate the PVC with your model files. This can be done via:
- A temporary pod that copies data to the PVC
- SimplePod's file transfer mechanisms
- Pre-signed URLs if your storage provider supports them
4. Deploy the Pod
bash
kubectl apply -f nemotron-simplepod.yaml5. Access the Service
If you created the LoadBalancer service:
bash
# Get the external IP
kubectl get service nemotron-3-super-service
# Then access via http://<EXTERNAL-IP>Configuration Options
GPU Resource Requests
Adjust based on your tensor parallelism needs:
yaml
resources:
limits:
nvidia.com/gpu: 2 # For tensor parallel size 2For tensor parallel size 4: nvidia.com/gpu: 4 For tensor parallel size 1: nvidia.com/gpu: 1 (requires sufficient single GPU memory)
Model Path Adjustments
If your model is stored in a different location within the volume:
yaml
args:
- --model
- /mnt/models/Nemotron-3-Super-120B-A12B-NVFP4 # Adjust path
# ... rest of argsAlternative Inference Engines
Simply change the image and args:
- SGLang:
lmsys/sglang:latestwith appropriate launch command - TensorRT-LLM: Custom image with pre-built engine
- Triton:
nvcr.io/nvidia/tritonserver:24.07-py3with model repository
Scaling Options
Single Pod (For Development/Test)
As shown above - good for testing and low-traffic scenarios.
Deployment for High Availability
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nemotron-3-super-deployment
spec:
replicas: 2
selector:
matchLabels:
app: nemotron-3-super
template:
metadata:
labels:
app: nemotron-3-super
spec:
containers:
- name: vllm-server
# ... same container spec as in pod
# Add pod anti-affinity for better distribution
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- nemotron-3-super
topologyKey: "kubernetes.io/hostname"Horizontal Pod Autoscaler
yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: nemotron-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: nemotron-3-super-deployment
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: gpu
target:
type: Utilization
averageUtilization: 80Cost Optimization Strategies
1. Right-Sizing Instances
- Start with minimal GPU allocation for testing
- Monitor utilization via SimplePod dashboard or kubectl top
- Adjust based on actual workload patterns
- Consider using heterogeneous node pools if available
2. Spot Instances / Preemptible Nodes
SimplePod may offer spot instances - check their documentation:
- Significantly lower cost
- Suitable for fault-tolerant workloads
- Implement checkpointing for long-running inference jobs
3. Efficient Resource Utilization
- Use tensor parallelism to distribute model across multiple GPUs
- Adjust batch size and max_num_seqs based on latency requirements
- Monitor GPU memory usage and adjust utilization factor
4. Storage Optimization
- Use appropriate storage class for your access pattern
- Consider caching frequently accessed model layers
- Archive older model versions if doing frequent updates
Monitoring and Logging
Basic Monitoring
bash
# View pod logs
kubectl logs -f nemotron-3-super-vllm
# Describe pod for events and resource usage
kubectl describe pod nemotron-3-super-vllm
# Top command for resource usage
kubectl top pod nemotron-3-super-vllmAdvanced Monitoring
SimplePod may integrate with:
- Prometheus/Grafana for metrics
- ELK stack for logs
- Custom monitoring solutions
Check SimplePod documentation for available observability features.
Troubleshooting
Common Issues
Pod Pending State
- Check events:
kubectl describe pod <name> - Common causes: insufficient GPU quota, PVC binding issues
- Solution: Check quota with SimplePod dashboard or support
- Check events:
Container Crashing
- Check logs:
kubectl logs <pod-name> - Common causes: missing HF token, incorrect model path, OOM
- Solution: Fix configuration and redeploy
- Check logs:
Out of Memory (OOM)
- Look for OOMKilled in pod status
- Solutions:
- Reduce
--tensor-parallel-size - Lower
--gpu-memory-utilization(e.g., 0.9 → 0.8) - Check if model is properly quantized
- Increase GPU allocation if possible
- Reduce
Connection Issues
- Verify service type and ports
- Check if LoadBalancer has external IP assigned
- Test port forwarding:
kubectl port-forward pod/nemotron-3-super-vllm 8000:8000
Diagnostic Commands
bash
# Check PVC status
kubectl get pvc nemotron-model-pvc
# Check node GPU allocation
kubectl describe nodes | grep -A 10 -B 5 "nvidia.com/gpu"
# Check events for namespace
kubectl get events --sort-by='.lastTimestamp'