Instructions to use Qwen/Qwen3-30B-A3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Qwen/Qwen3-30B-A3B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Qwen/Qwen3-30B-A3B")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Qwen/Qwen3-30B-A3B", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Qwen/Qwen3-30B-A3B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Qwen/Qwen3-30B-A3B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen3-30B-A3B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Qwen/Qwen3-30B-A3B
- SGLang
How to use Qwen/Qwen3-30B-A3B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Qwen/Qwen3-30B-A3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen3-30B-A3B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Qwen/Qwen3-30B-A3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen3-30B-A3B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Qwen/Qwen3-30B-A3B with Docker Model Runner:
docker model run hf.co/Qwen/Qwen3-30B-A3B
repetition
Did anyone notice bad repetition when roleplaying ? Phrases are repeated over multiple messages, I've tried many sampler settings including the recommended ones with presence penalty. I've also heard of many other people having this issue so it seems like a model problem.
Yes, the problem is too severe to proceed in vllm . I'll temporarily switch back to QwQ32B until a fix is available
Repetion always happen when I use the model on long contexts with YaRN enabled. Did you also use YaRN?
Unusably repetitive at just 8k context, no matter what settings are used. Refuses to drive the story, just repeats the same descriptions/dialogue with a few subtle changes. Repeats the same words & phrases from one story in another, despite different setting & scenario. Also forgets details immediately, when not close to max context length. A character is drinking a silver liquid out of a tumbler. Then the liquid is amber. Then they're drinking out of a wineglass. An object that was placed in a desk falls out of a character's pocket, etc.
similar observation here. If the prompt and expected output too long, then it starts to repeat. Not repeat for short prompts. Wonder it is something wrong with Rope + Yarn. Anyone has an idea how to fix?
Minimize context window to clear the pattern/add variance. I've lowered the context window down to minimal exchanges/tokens (5 exchanges/75 tokens). Did a few turns of breaking the pattern in different ways, then added the context back in slowly (if needed). If UI allows, break off from point of repetition or erase the repeating messages.
Break the repetition pattern - ID what it's locked into. Do a few turns, asking it to produce in varying structures.
- Forced structure -> "Give me a grocery list for lasagna in paragraph form"
- Disrupt measurement pattern - demand the unmeasurable: "Tell me the texture of waiting in one line"
- Pull into pure contradiction - break the logic chain -> "Hold still and run simultaneously"
Parameter settings: Models that repeat seem to need ranges within temp .6-.8, top_p .9-1, lower Freq (~.1) & Rep (1-2). refer to model card.
Soften overly strong/ absolute commands: "You must be concise", "Always do XX" - these are interpreted too literally and they get stuck in concept. Add a line that allows variance like "Provide varied and engaging responses."
Negation "echo chamber" - Negation often reinforces the very content you’re trying to avoid. The model latches onto the semantic content, drops the negation and creates an echo chamber. Reframe positively.
- LLMs don't understand "absence" and cannot truly process negation
- They're required to generate tokens - it cannot generate "nothing"
- Reward/attention/training mechanisms focuses & is biased on existing tokens "toward an event happening", not absent ones.
"Don't be scared" → model hears "be scared"
"She didn't run" → Model thinks "she ran"
"It did not happened" → Model generates "it happened"