vLLM 0.18.0 Throughput Benchmarks

Offline inference · 64 prompts × 512 input + 512 output tokens · float16 · Triton attention

Models Benchmarked

121.6

Peak tok/s

Qwen 3.5 Models

Failed (HW Limit)

Results

8 models

Qwen3.5-0.8B

FP16Dense

Total121.6 tok/s

TP1

PP60.8

TG60.8

Qwen3.5-2B

FP16Dense

Total120.8 tok/s

TP1

PP60.4

TG60.4

Qwen3.5-4B

FP16Dense

Total116.9 tok/s

TP1

PP58.5

TG58.5

Qwen3.5-9B

FP16Dense

Total114.7 tok/s

TP1

PP57.3

TG57.3

Qwen3.5-27B

GPTQ-Int4Dense

Total75.7 tok/s

TP4

PP37.8

TG37.8

Qwen3.5-35B-A3B

GPTQ-Int4MoE 3B/35B

Total74.0 tok/s

TP8

PP37.0

TG37.0

Qwen3.5-122B-A10B

GPTQ-Int4MoE 10B/122B

Total53.2 tok/s

TP8

PP26.6

TG26.6

GLM-4.7 (REAP-218B)

W4A16MoE 32B/218B

Total53.9 tok/s

TP4

PP27.0

TG27.0

Throughput Comparison

Total tok/s

Qwen3.5-0.8B

121.6

Qwen3.5-2B

120.8

Qwen3.5-4B

116.9

Qwen3.5-9B

114.7

Qwen3.5-27B

75.7

Qwen3.5-35B-A3B

74.0

GLM-4.7

53.9

Qwen3.5-122B

53.2

Failed Benchmarks

3 models

Model	Params	Quant	Root Cause
Qwen3.5-397B-A17B	397B	GPTQ-Int4	~200 GB weights exceed usable VRAM for KV cache
Nemotron-30B-A3B	31.6B	FP16	Mamba-2 Triton kernels OOM on V100; compressed-tensors needs CC≥75
MiniMax M2.5	228B	AWQ-4bit	compressed-tensors requires compute capability ≥75 (Turing+)

Analysis

Memory-Bandwidth Ceiling

Nearly flat throughput from 0.8B–9B (121→115 tok/s) confirms HBM2 bandwidth (~900 GB/s) as the bottleneck. Model size is irrelevant until multi-GPU TP is forced.

Multi-GPU Penalty

~38% throughput drop from TP=1 to TP=4. NVLink all-reduce synchronization on every decode step dominates the cost.

MoE Efficiency

35B-A3B MoE (3B active) matches the 27B dense model despite 8× total params. Sparse activation makes MoE memory-bound, not compute-bound.

122B at 53 tok/s

122B-A10B MoE at GPTQ-Int4 with TP=8 — only 28% slower than 35B MoE despite 3.5× more parameters.

V100 Compatibility

Feature	V100 (CC 7.0)	Min Required
FP16 inference	Supported	CC 6.0
BF16 inference	Not supported	CC 8.0
GPTQ quantization	Supported	CC 6.0
AWQ quantization	Partial	CC 7.5
Flash Attention 2	Not supported	CC 8.0
Mamba-2 (Triton)	OOM at autotune	CC 8.0+

llama.cpp Throughput Benchmarks

llama-bench · GGUF models · 8x V100-SXM2-32GB · highest quant per model

Models Benchmarked

11,644.8

Peak PP tok/s

138.5

Peak TG tok/s

Failed

Results

11 models

Qwen3.5-0.8B

QWEN35 0.8B Q8_0Dense

PP11,644.8 tok/s

TG138.5 tok/s

Qwen3.5-2B

QWEN35 2B Q8_0Dense

PP8,069.5 tok/s

TG121.0 tok/s

Qwen3.5-4B

QWEN35 4B Q8_0Dense

PP3,868.3 tok/s

TG72.3 tok/s

Qwen3.5-9B

QWEN35 9B Q8_0Dense

PP2,579.9 tok/s

TG55.9 tok/s

Nemotron-30B-A3B

NEMOTRON_H_MOE 31B.A3.5B Q8_0MoE

PP1,151.6 tok/s

TG78.2 tok/s

Qwen3.5-27B

QWEN35 27B Q8_0Dense

PP936.3 tok/s

TG19.8 tok/s

Qwen3.5-35B-A3B

QWEN35MOE 35B.A3B Q8_0MoE

PP747.3 tok/s

TG56.9 tok/s

Qwen3.5-122B-A10B

QWEN35MOE 122B.A10B Q8_0MoE

PP471.9 tok/s

TG33.6 tok/s

MiniMax-M2.5

MINIMAX-M2 230B.A10B Q4_K - MEDIUMMoE

PP278.6 tok/s

TG43.4 tok/s

Qwen3.5-397B-A17B

QWEN35MOE 397B.A17B Q4_K - MEDIUMMoE

PP215.5 tok/s

TG25.1 tok/s

GLM-4.7

GLM4MOE 355B.A32B Q4_K - MEDIUMMoE

PP154.5 tok/s

TG17.4 tok/s

Prompt Processing (PP) Throughput

PP tok/s

Qwen3.5-0.8B

11,644.8

Qwen3.5-2B

8,069.5

Qwen3.5-4B

3,868.3

Qwen3.5-9B

2,579.9

Nemotron-30B-A3B

1,151.6

Qwen3.5-27B

936.3

Qwen3.5-35B-A3B

747.3

Qwen3.5-122B-A10B

471.9

MiniMax-M2.5

278.6

Qwen3.5-397B-A17B

215.5

GLM-4.7

154.5

Token Generation (TG) Throughput

TG tok/s

Qwen3.5-0.8B

138.5

Qwen3.5-2B

121.0

Nemotron-30B-A3B

78.2

Qwen3.5-4B

72.3

Qwen3.5-35B-A3B

56.9

Qwen3.5-9B

55.9

MiniMax-M2.5

43.4

Qwen3.5-122B-A10B

33.6

Qwen3.5-397B-A17B

25.1

Qwen3.5-27B

19.8

GLM-4.7

17.4

Full Results Table

Model	Quant	Type	PP tok/s	TG tok/s
Qwen3.5-0.8B	QWEN35 0.8B Q8_0	Dense	11,644.8	138.5
Qwen3.5-2B	QWEN35 2B Q8_0	Dense	8,069.5	121.0
Qwen3.5-4B	QWEN35 4B Q8_0	Dense	3,868.3	72.3
Qwen3.5-9B	QWEN35 9B Q8_0	Dense	2,579.9	55.9
Nemotron-30B-A3B	NEMOTRON_H_MOE 31B.A3.5B Q8_0	MoE	1,151.6	78.2
Qwen3.5-27B	QWEN35 27B Q8_0	Dense	936.3	19.8
Qwen3.5-35B-A3B	QWEN35MOE 35B.A3B Q8_0	MoE	747.3	56.9
Qwen3.5-122B-A10B	QWEN35MOE 122B.A10B Q8_0	MoE	471.9	33.6
MiniMax-M2.5	MINIMAX-M2 230B.A10B Q4_K - MEDIUM	MoE	278.6	43.4
Qwen3.5-397B-A17B	QWEN35MOE 397B.A17B Q4_K - MEDIUM	MoE	215.5	25.1
GLM-4.7	GLM4MOE 355B.A32B Q4_K - MEDIUM	MoE	154.5	17.4

vLLM 0.18.0 Throughput Benchmarks

Results

Throughput Comparison

Failed Benchmarks

Analysis

Memory-Bandwidth Ceiling

Multi-GPU Penalty

MoE Efficiency

122B at 53 tok/s

V100 Compatibility

llama.cpp Throughput Benchmarks

Benchmarks Running

Results

Prompt Processing (PP) Throughput

Token Generation (TG) Throughput

Full Results Table