Slower than Qwen3-8B despite claimed 3x inference speedup

#16

by coszeros - opened Aug 27

Discussion

coszeros

Aug 27

Environment

Hardware: NVIDIA GeForce RTX 4090 24GB
vLLM Version: 0.10.1
Task: Classification Q&A (primarily prefilling-dependent)

Problem Description

I'm experiencing performance issues that contradict the claimed "up to 3x inference" speedup mentioned in the technical report.

Test Setup

Input length: ~4096 tokens
Output length: ~1 token (minimal generation)
Task type: Classification Q&A (prefilling-heavy workload)

Observations

Memory Requirements:
- This model requires deployment across 2 GPUs (TP=2)
- Qwen3-8B can run on a single GPU
Performance Comparison:
- For fair comparison, I also deployed Qwen3-8B across 2 GPUs (TP=2)
- Result: This model is approximately 1/7 slower than Qwen3-8B

Expected Behavior

Based on the technical report claiming "up to 3x inference" speedup, I expected this model to perform significantly faster than Qwen3-8B, especially for prefilling-heavy tasks.

Actual Behavior

The model performs ~14% slower than Qwen3-8B under identical hardware conditions (2x RTX 4090).

Questions

Is this performance degradation expected for prefilling-heavy workloads?
Are there specific optimization settings recommended for this use case?

Any insights into optimizing performance for this specific use case would be greatly appreciated.

JJKiks

Aug 27

Need to generate more tokens to get more meaningful measure.

NEWWWWWbie

Sep 2

Hi, can I ask if you can run this model on a single 4090 GPU?

coszeros

Sep 2

Hi, can I ask if you can run this model on a single 4090 GPU?

In my setup, at least two NVIDIA RTX 4090 GPUs are necessary.

ganeshjcs

Sep 4

•

edited Sep 4

have you tried setting max_num_seqs=32 and gpu_memory_utilization=0.98? This allows for the model to load in a single 24GB GPU. (I use A10g)

coszeros

Sep 5

have you tried setting max_num_seqs=32 and gpu_memory_utilization=0.98? This allows for the model to load in a single 24GB GPU. (I use A10g)

Thanks for the suggestions! After setting these parameters, the model loads successfully on a single 24GB GPU. However, it's still running slower than Qwen3-8B (around 7,500 toks/s vs. 10,000 toks/s). I'm wondering if longer decoding processes are needed to see the inference speedup benefits?

boolean380

Sep 5

•

edited Sep 5

I also try to compare nano-v2-9B and Qwen3-8B in latecy. But I found nano-v2-9B is not obvious faster than Qwen3-8B and may be slower than Qwen3-32B. My test prompt need LLM write paper into 6000 token and prompt have 500 token for requirements. Someone have idea???? help help~

Has anyone tried it on your machines according to the my requirements? Or is it an issue with my settings?

env: vllm

commond:

*start nano-v2-9b: python -m vllm.entrypoints.openai.api_server --model "/home/nvidia/NVIDIA-Nemotron-Nano-9B-v2" --dtype float16--api-key abc --tensor-parallel-size 1 --gpu-memory-utilization 0.95 --trust_remote_code --mamba_ssm_cache_dtype auto

*start qwen3-8b: python -m vllm.entrypoints.openai.api_server --model "/home/Qwen/Qwen3-8B" --dtype float16 --api-key abc --tensor-parallel-size 1 --gpu-memory-utilization 0.95

coszeros

Sep 5

I also try to compare nano-v2-9B and Qwen3-8B in latecy. But I found nano-v2-9B is not obvious faster than Qwen3-8B and may be slower than Qwen3-32B. My test prompt need LLM write paper into 6000 token and prompt have 500 token for requirements. Someone have idea???? help help

env: vllm

commond:

*start nano-v2-9b: python -m vllm.entrypoints.openai.api_server --model "/home/nvidia/NVIDIA-Nemotron-Nano-9B-v2" --dtype float16--api-key abc --tensor-parallel-size 1 --gpu-memory-utilization 0.95 --trust_remote_code --mamba_ssm_cache_dtype auto

*start qwen3-8b: python -m vllm.entrypoints.openai.api_server --model "/home/Qwen/Qwen3-8B" --dtype float16 --api-key abc --tensor-parallel-size 1 --gpu-memory-utilization 0.95

Is the test configuration aligned? I've noticed discrepancies in input token usage.
How many test samples are used? More test samples may be necessary to establish a reliable comparison.

boolean380

Sep 8

Thank you for your quick feedback. They are the same prompt, but due to differences in the tokenizer methods between the two models, the input tokens are not identical. Since the input differences are minimal, we are assuming they are approximately comparable for now. I tested with three samples by converting different questions, but the conclusions remained consistent. The table currently only shows the results of one of the prompts.

Or do you have any successful examples? Could you allow me to try and reproduce them on my side? I’m feeling a bit confused :( .

coszeros

Sep 8

Thank you for your quick feedback. They are the same prompt, but due to differences in the tokenizer methods between the two models, the input tokens are not identical. Since the input differences are minimal, we are assuming they are approximately comparable for now. I tested with three samples by converting different questions, but the conclusions remained consistent. The table currently only shows the results of one of the prompts.

Or do you have any successful examples? Could you allow me to try and reproduce them on my side? I’m feeling a bit confused :( .

I'm a bit confused as well. Does anyone have practical ideas or examples about speeding up inference?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment