Contents

Large Language Model Inference Framework Throughput Comparison: VLLM | SGLang | LMDeploy

This article compares the throughput of three large language model inference engines, VLLM, SGLang, and LMDeploy, in a short-input, long-output scenario. The unit of measurement is output tokens per second.

A simple comparison of the throughput of three large language model inference engines, measured in output tokens per second, in a short-input, long-output scenario. See the table below for other parameters.

VLLM | SGLang | LMDeploy

Concurrency VLLM 0.6.1.post2 VLLM 0.6.3.post1 LMDeploy 0.6.0a0 LMDeploy 0.6.2 SGLang 0.3.4.post2 SGLang 0.3.4.post2(–disable-cuda-graph)
1 28.73 28.76 56.19 57.24 37.23 29.96
2 71.53 73.26 113.12 113.48 73.59 58.28
4 133.38 136.05 205.51 199.01 136.73 111.24
8 246.14 251.59 398.73 393.48 258.21 215.53
16 394.25 401.67 704.69 709.27 461.89 444.48
32 480.26 481.75 967.34 973.24 562.36 557.93
64 520.11 526.01 1119.22 1123.07 594.03 602.36
128 479.02 481.63 989.14 890.44 534.69 582.97
  • Test Model: Qwen2.5-14B-Instruct-AWQ
  • Hardware: E5 2680v4 + 2080ti 22G * 1

/ob/static/images/Pasted%20image%2020241123103935.png
/ob/static/images/Pasted%20image%2020241123103943.png