Offline inference · 64 prompts × 512 input + 512 output tokens · float16 · Triton attention
| Model | Params | Quant | Root Cause |
|---|---|---|---|
| Qwen3.5-397B-A17B | 397B | GPTQ-Int4 | ~200 GB weights exceed usable VRAM for KV cache |
| Nemotron-30B-A3B | 31.6B | FP16 | Mamba-2 Triton kernels OOM on V100; compressed-tensors needs CC≥75 |
| MiniMax M2.5 | 228B | AWQ-4bit | compressed-tensors requires compute capability ≥75 (Turing+) |
Nearly flat throughput from 0.8B–9B (121→115 tok/s) confirms HBM2 bandwidth (~900 GB/s) as the bottleneck. Model size is irrelevant until multi-GPU TP is forced.
~38% throughput drop from TP=1 to TP=4. NVLink all-reduce synchronization on every decode step dominates the cost.
35B-A3B MoE (3B active) matches the 27B dense model despite 8× total params. Sparse activation makes MoE memory-bound, not compute-bound.
122B-A10B MoE at GPTQ-Int4 with TP=8 — only 28% slower than 35B MoE despite 3.5× more parameters.
| Feature | V100 (CC 7.0) | Min Required |
|---|---|---|
| FP16 inference | Supported | CC 6.0 |
| BF16 inference | Not supported | CC 8.0 |
| GPTQ quantization | Supported | CC 6.0 |
| AWQ quantization | Partial | CC 7.5 |
| Flash Attention 2 | Not supported | CC 8.0 |
| Mamba-2 (Triton) | OOM at autotune | CC 8.0+ |
llama-bench · GGUF models · 8x V100-SXM2-32GB · highest quant per model
| Model | Quant | Type | PP tok/s | TG tok/s |
|---|---|---|---|---|
| Qwen3.5-0.8B | QWEN35 0.8B Q8_0 | Dense | 11,644.8 | 138.5 |
| Qwen3.5-2B | QWEN35 2B Q8_0 | Dense | 8,069.5 | 121.0 |
| Qwen3.5-4B | QWEN35 4B Q8_0 | Dense | 3,868.3 | 72.3 |
| Qwen3.5-9B | QWEN35 9B Q8_0 | Dense | 2,579.9 | 55.9 |
| Nemotron-30B-A3B | NEMOTRON_H_MOE 31B.A3.5B Q8_0 | MoE | 1,151.6 | 78.2 |
| Qwen3.5-27B | QWEN35 27B Q8_0 | Dense | 936.3 | 19.8 |
| Qwen3.5-35B-A3B | QWEN35MOE 35B.A3B Q8_0 | MoE | 747.3 | 56.9 |
| Qwen3.5-122B-A10B | QWEN35MOE 122B.A10B Q8_0 | MoE | 471.9 | 33.6 |
| MiniMax-M2.5 | MINIMAX-M2 230B.A10B Q4_K - MEDIUM | MoE | 278.6 | 43.4 |
| Qwen3.5-397B-A17B | QWEN35MOE 397B.A17B Q4_K - MEDIUM | MoE | 215.5 | 25.1 |
| GLM-4.7 | GLM4MOE 355B.A32B Q4_K - MEDIUM | MoE | 154.5 | 17.4 |