Text Generation
Transformers
Safetensors
PyTorch
nvidia
conversational

Slower than Qwen3-8B despite claimed 3x inference speedup

#16
by coszeros - opened

Environment

  • Hardware: NVIDIA GeForce RTX 4090 24GB
  • vLLM Version: 0.10.1
  • Task: Classification Q&A (primarily prefilling-dependent)

Problem Description

I'm experiencing performance issues that contradict the claimed "up to 3x inference" speedup mentioned in the technical report.

Test Setup

  • Input length: ~4096 tokens
  • Output length: ~1 token (minimal generation)
  • Task type: Classification Q&A (prefilling-heavy workload)

Observations

  1. Memory Requirements:

    • This model requires deployment across 2 GPUs (TP=2)
    • Qwen3-8B can run on a single GPU
  2. Performance Comparison:

    • For fair comparison, I also deployed Qwen3-8B across 2 GPUs (TP=2)
    • Result: This model is approximately 1/7 slower than Qwen3-8B

Expected Behavior

Based on the technical report claiming "up to 3x inference" speedup, I expected this model to perform significantly faster than Qwen3-8B, especially for prefilling-heavy tasks.

Actual Behavior

The model performs ~14% slower than Qwen3-8B under identical hardware conditions (2x RTX 4090).

Questions

  1. Is this performance degradation expected for prefilling-heavy workloads?
  2. Are there specific optimization settings recommended for this use case?

Any insights into optimizing performance for this specific use case would be greatly appreciated.

Need to generate more tokens to get more meaningful measure.

Hi, can I ask if you can run this model on a single 4090 GPU?

Hi, can I ask if you can run this model on a single 4090 GPU?

In my setup, at least two NVIDIA RTX 4090 GPUs are necessary.

have you tried setting max_num_seqs=32 and gpu_memory_utilization=0.98? This allows for the model to load in a single 24GB GPU. (I use A10g)

have you tried setting max_num_seqs=32 and gpu_memory_utilization=0.98? This allows for the model to load in a single 24GB GPU. (I use A10g)

Thanks for the suggestions! After setting these parameters, the model loads successfully on a single 24GB GPU. However, it's still running slower than Qwen3-8B (around 7,500 toks/s vs. 10,000 toks/s). I'm wondering if longer decoding processes are needed to see the inference speedup benefits?

I also try to compare nano-v2-9B and Qwen3-8B in latecy. But I found nano-v2-9B is not obvious faster than Qwen3-8B and may be slower than Qwen3-32B. My test prompt need LLM write paper into 6000 token and prompt have 500 token for requirements. Someone have idea???? help help~

Has anyone tried it on your machines according to the my requirements? Or is it an issue with my settings?

env: vllm

commond:

*start nano-v2-9b: python -m vllm.entrypoints.openai.api_server --model "/home/nvidia/NVIDIA-Nemotron-Nano-9B-v2" --dtype float16--api-key abc --tensor-parallel-size 1 --gpu-memory-utilization 0.95 --trust_remote_code --mamba_ssm_cache_dtype auto

*start qwen3-8b: python -m vllm.entrypoints.openai.api_server --model "/home/Qwen/Qwen3-8B" --dtype float16 --api-key abc --tensor-parallel-size 1 --gpu-memory-utilization 0.95

image.png

I also try to compare nano-v2-9B and Qwen3-8B in latecy. But I found nano-v2-9B is not obvious faster than Qwen3-8B and may be slower than Qwen3-32B. My test prompt need LLM write paper into 6000 token and prompt have 500 token for requirements. Someone have idea???? help help

env: vllm

commond:

*start nano-v2-9b: python -m vllm.entrypoints.openai.api_server --model "/home/nvidia/NVIDIA-Nemotron-Nano-9B-v2" --dtype float16--api-key abc --tensor-parallel-size 1 --gpu-memory-utilization 0.95 --trust_remote_code --mamba_ssm_cache_dtype auto

*start qwen3-8b: python -m vllm.entrypoints.openai.api_server --model "/home/Qwen/Qwen3-8B" --dtype float16 --api-key abc --tensor-parallel-size 1 --gpu-memory-utilization 0.95

image.png

Is the test configuration aligned? I've noticed discrepancies in input token usage.
How many test samples are used? More test samples may be necessary to establish a reliable comparison.

Thank you for your quick feedback. They are the same prompt, but due to differences in the tokenizer methods between the two models, the input tokens are not identical. Since the input differences are minimal, we are assuming they are approximately comparable for now. I tested with three samples by converting different questions, but the conclusions remained consistent. The table currently only shows the results of one of the prompts.

Or do you have any successful examples? Could you allow me to try and reproduce them on my side? I’m feeling a bit confused :( .

Thank you for your quick feedback. They are the same prompt, but due to differences in the tokenizer methods between the two models, the input tokens are not identical. Since the input differences are minimal, we are assuming they are approximately comparable for now. I tested with three samples by converting different questions, but the conclusions remained consistent. The table currently only shows the results of one of the prompts.

Or do you have any successful examples? Could you allow me to try and reproduce them on my side? I’m feeling a bit confused :( .

I'm a bit confused as well. Does anyone have practical ideas or examples about speeding up inference?

Sign up or log in to comment