Appearance
HybridMoE: Nemotron 3 Family Architecture
Nemotron 3 family of models utilize a hybrid Mamba-Transformer MoE architecture to provide best-in-class throughput while having better or on-par accuracy than standard Transformers.
Key Benefits
- Combines Mamba's efficient state space modeling with Transformer's attention mechanisms
- Mixture of Experts (MoE) activates only relevant experts per token
- Achieves superior throughput compared to dense or standard MoE models
- Maintains or improves accuracy benchmarks