vLLM 0.18.0 Throughput Benchmarks

Offline inference · 64 prompts × 512 input + 512 output tokens · float16 · Triton attention

8
Models Benchmarked
121.6
Peak tok/s
7
Qwen 3.5 Models
3
Failed (HW Limit)

Results

8 models
Qwen3.5-0.8B
FP16Dense
Total121.6 tok/s
TP1
PP60.8
TG60.8
Qwen3.5-2B
FP16Dense
Total120.8 tok/s
TP1
PP60.4
TG60.4
Qwen3.5-4B
FP16Dense
Total116.9 tok/s
TP1
PP58.5
TG58.5
Qwen3.5-9B
FP16Dense
Total114.7 tok/s
TP1
PP57.3
TG57.3
Qwen3.5-27B
GPTQ-Int4Dense
Total75.7 tok/s
TP4
PP37.8
TG37.8
Qwen3.5-35B-A3B
GPTQ-Int4MoE 3B/35B
Total74.0 tok/s
TP8
PP37.0
TG37.0
Qwen3.5-122B-A10B
GPTQ-Int4MoE 10B/122B
Total53.2 tok/s
TP8
PP26.6
TG26.6
GLM-4.7 (REAP-218B)
W4A16MoE 32B/218B
Total53.9 tok/s
TP4
PP27.0
TG27.0

Throughput Comparison

Total tok/s
Qwen3.5-0.8B
121.6
Qwen3.5-2B
120.8
Qwen3.5-4B
116.9
Qwen3.5-9B
114.7
Qwen3.5-27B
75.7
Qwen3.5-35B-A3B
74.0
GLM-4.7
53.9
Qwen3.5-122B
53.2

Failed Benchmarks

3 models
ModelParamsQuantRoot Cause
Qwen3.5-397B-A17B397BGPTQ-Int4~200 GB weights exceed usable VRAM for KV cache
Nemotron-30B-A3B31.6BFP16Mamba-2 Triton kernels OOM on V100; compressed-tensors needs CC≥75
MiniMax M2.5228BAWQ-4bitcompressed-tensors requires compute capability ≥75 (Turing+)

Analysis

Memory-Bandwidth Ceiling

Nearly flat throughput from 0.8B–9B (121→115 tok/s) confirms HBM2 bandwidth (~900 GB/s) as the bottleneck. Model size is irrelevant until multi-GPU TP is forced.

Multi-GPU Penalty

~38% throughput drop from TP=1 to TP=4. NVLink all-reduce synchronization on every decode step dominates the cost.

MoE Efficiency

35B-A3B MoE (3B active) matches the 27B dense model despite 8× total params. Sparse activation makes MoE memory-bound, not compute-bound.

122B at 53 tok/s

122B-A10B MoE at GPTQ-Int4 with TP=8 — only 28% slower than 35B MoE despite 3.5× more parameters.

V100 Compatibility

FeatureV100 (CC 7.0)Min Required
FP16 inferenceSupportedCC 6.0
BF16 inferenceNot supportedCC 8.0
GPTQ quantizationSupportedCC 6.0
AWQ quantizationPartialCC 7.5
Flash Attention 2Not supportedCC 8.0
Mamba-2 (Triton)OOM at autotuneCC 8.0+

llama.cpp Throughput Benchmarks

llama-bench · GGUF models · 8x V100-SXM2-32GB · highest quant per model

11
Models Benchmarked
11,644.8
Peak PP tok/s
138.5
Peak TG tok/s
0
Failed

Results

11 models
Qwen3.5-0.8B
QWEN35 0.8B Q8_0Dense
PP11,644.8 tok/s
TG138.5 tok/s
Qwen3.5-2B
QWEN35 2B Q8_0Dense
PP8,069.5 tok/s
TG121.0 tok/s
Qwen3.5-4B
QWEN35 4B Q8_0Dense
PP3,868.3 tok/s
TG72.3 tok/s
Qwen3.5-9B
QWEN35 9B Q8_0Dense
PP2,579.9 tok/s
TG55.9 tok/s
Nemotron-30B-A3B
NEMOTRON_H_MOE 31B.A3.5B Q8_0MoE
PP1,151.6 tok/s
TG78.2 tok/s
Qwen3.5-27B
QWEN35 27B Q8_0Dense
PP936.3 tok/s
TG19.8 tok/s
Qwen3.5-35B-A3B
QWEN35MOE 35B.A3B Q8_0MoE
PP747.3 tok/s
TG56.9 tok/s
Qwen3.5-122B-A10B
QWEN35MOE 122B.A10B Q8_0MoE
PP471.9 tok/s
TG33.6 tok/s
MiniMax-M2.5
MINIMAX-M2 230B.A10B Q4_K - MEDIUMMoE
PP278.6 tok/s
TG43.4 tok/s
Qwen3.5-397B-A17B
QWEN35MOE 397B.A17B Q4_K - MEDIUMMoE
PP215.5 tok/s
TG25.1 tok/s
GLM-4.7
GLM4MOE 355B.A32B Q4_K - MEDIUMMoE
PP154.5 tok/s
TG17.4 tok/s

Prompt Processing (PP) Throughput

PP tok/s
Qwen3.5-0.8B
11,644.8
Qwen3.5-2B
8,069.5
Qwen3.5-4B
3,868.3
Qwen3.5-9B
2,579.9
Nemotron-30B-A3B
1,151.6
Qwen3.5-27B
936.3
Qwen3.5-35B-A3B
747.3
Qwen3.5-122B-A10B
471.9
MiniMax-M2.5
278.6
Qwen3.5-397B-A17B
215.5
GLM-4.7
154.5

Token Generation (TG) Throughput

TG tok/s
Qwen3.5-0.8B
138.5
Qwen3.5-2B
121.0
Nemotron-30B-A3B
78.2
Qwen3.5-4B
72.3
Qwen3.5-35B-A3B
56.9
Qwen3.5-9B
55.9
MiniMax-M2.5
43.4
Qwen3.5-122B-A10B
33.6
Qwen3.5-397B-A17B
25.1
Qwen3.5-27B
19.8
GLM-4.7
17.4

Full Results Table

ModelQuantTypePP tok/sTG tok/s
Qwen3.5-0.8BQWEN35 0.8B Q8_0Dense 11,644.8 138.5
Qwen3.5-2BQWEN35 2B Q8_0Dense 8,069.5 121.0
Qwen3.5-4BQWEN35 4B Q8_0Dense 3,868.3 72.3
Qwen3.5-9BQWEN35 9B Q8_0Dense 2,579.9 55.9
Nemotron-30B-A3BNEMOTRON_H_MOE 31B.A3.5B Q8_0MoE 1,151.6 78.2
Qwen3.5-27BQWEN35 27B Q8_0Dense 936.3 19.8
Qwen3.5-35B-A3BQWEN35MOE 35B.A3B Q8_0MoE 747.3 56.9
Qwen3.5-122B-A10BQWEN35MOE 122B.A10B Q8_0MoE 471.9 33.6
MiniMax-M2.5MINIMAX-M2 230B.A10B Q4_K - MEDIUMMoE 278.6 43.4
Qwen3.5-397B-A17BQWEN35MOE 397B.A17B Q4_K - MEDIUMMoE 215.5 25.1
GLM-4.7GLM4MOE 355B.A32B Q4_K - MEDIUMMoE 154.5 17.4