My thoughts on CoT of this model

by MrDevolver - opened Nov 30, 2025

Nov 30, 2025

Hello,

I just converted this model to GGUF format in F32 precision which as far as I am aware should be more than enough to represent the original accuracy of safetensor format.

Running this model in LM Studio at the moment with a coding task. I know this is probably not in the scope of what this model was meant to be used for, but I was actually surprised about how good is the model at thinking about the given prompt despite that. However, I'm not sure why, but when the model already has some solution for some part of the code logic, it suddenly decides that there's a better way and starts thinking about it all over again, trying to follow that "better way". From my point of view, the second approach wasn't really better, just different and it used a simplified algoritm. However, even that seemed to be not good enough for the model and after writing all of that, it decided to think about a third solution and again, I wouldn't call it a better one either.

This led me to conclusion that maybe the model suffers from the infamous problem of overthinking like many other models do.

I mean, we want the best possible response overall, but if we get stuck thinking about a small part of the bigger problem, we will actually never get to the final response to see if the first offered solution for that small part of the bigger problem was good enough or not.

Is there really no way to convince the model that one solution for a small part of the bigger problem is enough?

leran1995

Nanbeige LLM Lab org Dec 1, 2025

Thank you for your interest in our work and for sharing your detailed observations! 😊

In the current 2511 release version, we have prioritized pushing the performance boundaries of small models 🚀. As a result, during both SFT trajectory filtering and RL reward design, we have not yet incorporated explicit length control or length penalties ⏳. This may sometimes lead the model to overthink, even when an earlier solution was already sufficient.

We acknowledge this behavior and are actively exploring ways to encourage more efficient reasoning in future versions, without compromising overall output quality. Your feedback is greatly appreciated and will help inform these improvements.

leran1995 changed discussion status to closed Dec 2, 2025

urtuuuu

25 days ago

How about generation parameters, temp, top p, etc? How do i know if i use correct ones?

owao

25 days ago

•

edited 25 days ago

@urtuuuu

For all benchmarks, we use a sampling temperature of 0.6 and top-p of 0.95, and we set the maximum
generation length to 64k tokens.

that's in their paper on arxiv
But if you guys wouldn't mind sharing how you ran the model for the benchmarks, I'm wondering too. Was it was through transformers or vllm or other? So we can have the values used that for the remaining parameters

Btw thanks for sharing this model. Polaris vibes :)

owao

23 days ago

@leran1995 I just looked at your WritingBench results... This is amazing!!!

owao

23 days ago

•

edited 23 days ago

For EQ-Bench 3, do you by chance know what the "compliance" criteria evaluates? I couldn't find the information either in their repo nor their paper, and I guess their dataset is private?
Is it instruction following?

leran1995 changed discussion status to open 23 days ago

chenzongchao

Nanbeige LLM Lab org 23 days ago

Thanks for the question! For practical reference, here's a working setup I've been using successfully:

Framework: SGLang v0.4.8+ (great performance for long contexts)

python3 -m sglang.launch_server \
  --model-path {model_path} \
  --host 0.0.0.0 \
  --trust-remote-code \
  --enable-torch-compile \
  --tp-size 1

Generation parameters:

temperature: 0.6
top_p: 0.95
repetition_penalty: 1.0
max_tokens: not explicitly set (defaults to model's limit)
timeout: 3600

Client call:

response = client.chat.completions.create(
    messages=messages,
    model=model,
    temperature=temp,
    top_p=0.95,
    repetition_penalty=1.0,
    stream=False,
    timeout=3600
)

Happy to share more details if you're debugging specific evals.

owao

23 days ago

Thanks! I tried it in place of llama.cpp, same sampling params, and I that was intriguing because on my test, it was 3 success attempts with sglang vs 3 fails using bf16 gguf. But I guess I have to try more! But it's long! Because my test makes it burn ~35k tokens each time!

SGLang v0.4.8+ (great performance for long contexts)

And thanks for this tip! with llama.cpp starts at ~80t/s but then down to 30t/s at ~30k! While with sglang I had almost a constant 90t/s (i dont remeber well but at least stable for the first 10K)! I'll also try without capturing the cuda graph because it takes so long to do the autotune!!
Thanks again!

owao

23 days ago

Oh ok, I get minus 20t/s at start without cuda graph disabled! Still stable over time though!

urtuuuu

23 days ago

•

edited 23 days ago

Underrated model. It's really smart. It even answered two of my test questions, that only few 20+ B parameters models could answer, like Qwq 32B.
I thought its impossible for a tiny model.

leran1995

Nanbeige LLM Lab org 23 days ago

For EQ-Bench 3, do you by chance know what the "compliance" criteria evaluates? I couldn't find the information either in their repo nor their paper, and I guess their dataset is private?
Is it instruction following?

Hi @owao ,

In EQ-Bench 3, each item is evaluated across 11 dimensions, and "compliance" is one of them.
You can find the official scoring prompt template here:
https://github.com/EQ-bench/eqbench3/blob/main/data/rubric_scoring_prompt.txt

Hope that helps!

leran1995 changed discussion status to closed 21 days ago

owao

19 days ago

•

edited 19 days ago

Thanks!
Yeah so it's completely subjective and I guess only the judge model knows what it is :D I'd better ask it instead!

leran1995 changed discussion status to open 19 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment