My thoughts on CoT of this model
Hello,
I just converted this model to GGUF format in F32 precision which as far as I am aware should be more than enough to represent the original accuracy of safetensor format.
Running this model in LM Studio at the moment with a coding task. I know this is probably not in the scope of what this model was meant to be used for, but I was actually surprised about how good is the model at thinking about the given prompt despite that. However, I'm not sure why, but when the model already has some solution for some part of the code logic, it suddenly decides that there's a better way and starts thinking about it all over again, trying to follow that "better way". From my point of view, the second approach wasn't really better, just different and it used a simplified algoritm. However, even that seemed to be not good enough for the model and after writing all of that, it decided to think about a third solution and again, I wouldn't call it a better one either.
This led me to conclusion that maybe the model suffers from the infamous problem of overthinking like many other models do.
I mean, we want the best possible response overall, but if we get stuck thinking about a small part of the bigger problem, we will actually never get to the final response to see if the first offered solution for that small part of the bigger problem was good enough or not.
Is there really no way to convince the model that one solution for a small part of the bigger problem is enough?
Thank you for your interest in our work and for sharing your detailed observations! 😊
In the current 2511 release version, we have prioritized pushing the performance boundaries of small models 🚀. As a result, during both SFT trajectory filtering and RL reward design, we have not yet incorporated explicit length control or length penalties ⏳. This may sometimes lead the model to overthink, even when an earlier solution was already sufficient.
We acknowledge this behavior and are actively exploring ways to encourage more efficient reasoning in future versions, without compromising overall output quality. Your feedback is greatly appreciated and will help inform these improvements.
How about generation parameters, temp, top p, etc? How do i know if i use correct ones?
For all benchmarks, we use a sampling temperature of 0.6 and top-p of 0.95, and we set the maximum
generation length to 64k tokens.
that's in their paper on arxiv
But if you guys wouldn't mind sharing how you ran the model for the benchmarks, I'm wondering too. Was it was through transformers or vllm or other? So we can have the values used that for the remaining parameters
Btw thanks for sharing this model. Polaris vibes :)
For EQ-Bench 3, do you by chance know what the "compliance" criteria evaluates? I couldn't find the information either in their repo nor their paper, and I guess their dataset is private?
Is it instruction following?
Thanks for the question! For practical reference, here's a working setup I've been using successfully:
Framework: SGLang v0.4.8+ (great performance for long contexts)
python3 -m sglang.launch_server \
--model-path {model_path} \
--host 0.0.0.0 \
--trust-remote-code \
--enable-torch-compile \
--tp-size 1
Generation parameters:
temperature: 0.6top_p: 0.95repetition_penalty: 1.0max_tokens: not explicitly set (defaults to model's limit)timeout: 3600
Client call:
response = client.chat.completions.create(
messages=messages,
model=model,
temperature=temp,
top_p=0.95,
repetition_penalty=1.0,
stream=False,
timeout=3600
)
Happy to share more details if you're debugging specific evals.
Thanks! I tried it in place of llama.cpp, same sampling params, and I that was intriguing because on my test, it was 3 success attempts with sglang vs 3 fails using bf16 gguf. But I guess I have to try more! But it's long! Because my test makes it burn ~35k tokens each time!
SGLang v0.4.8+ (great performance for long contexts)
And thanks for this tip! with llama.cpp starts at ~80t/s but then down to 30t/s at ~30k! While with sglang I had almost a constant 90t/s (i dont remeber well but at least stable for the first 10K)! I'll also try without capturing the cuda graph because it takes so long to do the autotune!!
Thanks again!
Oh ok, I get minus 20t/s at start without cuda graph disabled! Still stable over time though!
Underrated model. It's really smart. It even answered two of my test questions, that only few 20+ B parameters models could answer, like Qwq 32B.
I thought its impossible for a tiny model.
For EQ-Bench 3, do you by chance know what the "compliance" criteria evaluates? I couldn't find the information either in their repo nor their paper, and I guess their dataset is private?
Is it instruction following?
Hi @owao ,
In EQ-Bench 3, each item is evaluated across 11 dimensions, and "compliance" is one of them.
You can find the official scoring prompt template here:
https://github.com/EQ-bench/eqbench3/blob/main/data/rubric_scoring_prompt.txt
Hope that helps!
Thanks!
Yeah so it's completely subjective and I guess only the judge model knows what it is :D I'd better ask it instead!