Slower than Qwen3-8B despite claimed 3x inference speedup
Environment
- Hardware: NVIDIA GeForce RTX 4090 24GB
- vLLM Version: 0.10.1
- Task: Classification Q&A (primarily prefilling-dependent)
Problem Description
I'm experiencing performance issues that contradict the claimed "up to 3x inference" speedup mentioned in the technical report.
Test Setup
- Input length: ~4096 tokens
- Output length: ~1 token (minimal generation)
- Task type: Classification Q&A (prefilling-heavy workload)
Observations
Memory Requirements:
- This model requires deployment across 2 GPUs (TP=2)
- Qwen3-8B can run on a single GPU
Performance Comparison:
- For fair comparison, I also deployed Qwen3-8B across 2 GPUs (TP=2)
- Result: This model is approximately 1/7 slower than Qwen3-8B
Expected Behavior
Based on the technical report claiming "up to 3x inference" speedup, I expected this model to perform significantly faster than Qwen3-8B, especially for prefilling-heavy tasks.
Actual Behavior
The model performs ~14% slower than Qwen3-8B under identical hardware conditions (2x RTX 4090).
Questions
- Is this performance degradation expected for prefilling-heavy workloads?
- Are there specific optimization settings recommended for this use case?
Any insights into optimizing performance for this specific use case would be greatly appreciated.
Need to generate more tokens to get more meaningful measure.
Hi, can I ask if you can run this model on a single 4090 GPU?
Hi, can I ask if you can run this model on a single 4090 GPU?
In my setup, at least two NVIDIA RTX 4090 GPUs are necessary.
have you tried setting max_num_seqs=32 and gpu_memory_utilization=0.98? This allows for the model to load in a single 24GB GPU. (I use A10g)
have you tried setting
max_num_seqs=32andgpu_memory_utilization=0.98? This allows for the model to load in a single 24GB GPU. (I use A10g)
Thanks for the suggestions! After setting these parameters, the model loads successfully on a single 24GB GPU. However, it's still running slower than Qwen3-8B (around 7,500 toks/s vs. 10,000 toks/s). I'm wondering if longer decoding processes are needed to see the inference speedup benefits?
I also try to compare nano-v2-9B and Qwen3-8B in latecy. But I found nano-v2-9B is not obvious faster than Qwen3-8B and may be slower than Qwen3-32B. My test prompt need LLM write paper into 6000 token and prompt have 500 token for requirements. Someone have idea???? help help~
Has anyone tried it on your machines according to the my requirements? Or is it an issue with my settings?
env: vllm
commond:
*start nano-v2-9b: python -m vllm.entrypoints.openai.api_server --model "/home/nvidia/NVIDIA-Nemotron-Nano-9B-v2" --dtype float16--api-key abc --tensor-parallel-size 1 --gpu-memory-utilization 0.95 --trust_remote_code --mamba_ssm_cache_dtype auto
*start qwen3-8b: python -m vllm.entrypoints.openai.api_server --model "/home/Qwen/Qwen3-8B" --dtype float16 --api-key abc --tensor-parallel-size 1 --gpu-memory-utilization 0.95
I also try to compare nano-v2-9B and Qwen3-8B in latecy. But I found nano-v2-9B is not obvious faster than Qwen3-8B and may be slower than Qwen3-32B. My test prompt need LLM write paper into 6000 token and prompt have 500 token for requirements. Someone have idea???? help help
env: vllm
commond:
*start nano-v2-9b: python -m vllm.entrypoints.openai.api_server --model "/home/nvidia/NVIDIA-Nemotron-Nano-9B-v2" --dtype float16--api-key abc --tensor-parallel-size 1 --gpu-memory-utilization 0.95 --trust_remote_code --mamba_ssm_cache_dtype auto
*start qwen3-8b: python -m vllm.entrypoints.openai.api_server --model "/home/Qwen/Qwen3-8B" --dtype float16 --api-key abc --tensor-parallel-size 1 --gpu-memory-utilization 0.95
Is the test configuration aligned? I've noticed discrepancies in input token usage.
How many test samples are used? More test samples may be necessary to establish a reliable comparison.
Thank you for your quick feedback. They are the same prompt, but due to differences in the tokenizer methods between the two models, the input tokens are not identical. Since the input differences are minimal, we are assuming they are approximately comparable for now. I tested with three samples by converting different questions, but the conclusions remained consistent. The table currently only shows the results of one of the prompts.
Or do you have any successful examples? Could you allow me to try and reproduce them on my side? I’m feeling a bit confused :( .
Thank you for your quick feedback. They are the same prompt, but due to differences in the tokenizer methods between the two models, the input tokens are not identical. Since the input differences are minimal, we are assuming they are approximately comparable for now. I tested with three samples by converting different questions, but the conclusions remained consistent. The table currently only shows the results of one of the prompts.
Or do you have any successful examples? Could you allow me to try and reproduce them on my side? I’m feeling a bit confused :( .
I'm a bit confused as well. Does anyone have practical ideas or examples about speeding up inference?
