One question about this GGUF

#18
by alexloops - opened

Is this GGUF the same quality as the non-MTP version if I use it without MTP?
For example, so I don’t have to keep both files, one non-MTP and the other MTP.

Since the full models checks every single token, MTP should have identical quality in output even though it uses the draft model pre-pass. However, I think everyone needs to mess with the GGUF before they can say there aren't any bugs. Not sure if it is a bug or something else but I'm getting absurd memory usage with the UD-Q6. It's using 12GB more than the Q8. This goes away with MTP disabled. Not sure if this is an issue with llama or the GGUF itself but that seems extremely excessive. I know look ahead tokens can be adjusted and that it needs to map more memory, but ouch.

shimmyshimmer changed discussion status to closed

Sign up or log in to comment