HybridMoE: Nemotron 3 Family Architecture

Nemotron 3 family of models utilize a hybrid Mamba-Transformer MoE architecture to provide best-in-class throughput while having better or on-par accuracy than standard Transformers.

Key Benefits

Combines Mamba's efficient state space modeling with Transformer's attention mechanisms
Mixture of Experts (MoE) activates only relevant experts per token
Achieves superior throughput compared to dense or standard MoE models
Maintains or improves accuracy benchmarks