1

Qwen3-VL-8B-Instruct-Unredacted-MAX

Qwen3-VL-8B-Instruct-Unredacted-MAX is an optimized release built on top of huihui-ai/Huihui-Qwen3-VL-8B-Instruct-abliterated. This version focuses on packaging improvements, inference stability, and modern Transformers compatibility, while preserving the strong multimodal reasoning capabilities of the base architecture. The result is a powerful 8B vision-language model designed for efficient research, structured captioning, and multimodal experimentation at scale.

Key Highlights

  • Optimized Release Pipeline Improved repository structure and loading consistency for smoother deployment and inference.

  • Modern Transformers Integration Updated compatibility for recent Hugging Face Transformers versions and vision-language utilities.

  • 8B Vision-Language Architecture Built on Qwen3-VL-8B-Instruct, offering strong reasoning ability across image-text tasks with balanced compute requirements.

  • Stable Multimodal Inference Improved consistency for caption generation, visual reasoning, and structured outputs.

  • High-Quality Caption Generation Produces detailed, structured descriptions suitable for dataset creation, annotation workflows, and accessibility applications.

  • Dynamic Resolution Handling Maintains native support for variable image resolutions and aspect ratios.


Base Model Signatures

This model has been re-sharded and optimized for the latest Transformers version from the base model: https://huggingface.co/huihui-ai/Huihui-Qwen3-VL-8B-Instruct-abliterated


Quick Start with Transformers

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

model = Qwen3VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/Qwen3-VL-8B-Instruct-Unredacted-MAX",
    torch_dtype="auto",
    device_map="auto"
)

processor = AutoProcessor.from_pretrained(
    "prithivMLmods/Qwen3-VL-8B-Instruct-Unredacted-MAX"
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Provide a detailed caption for this image."},
        ],
    }
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

image_inputs, video_inputs = process_vision_info(messages)

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=256)

output_text = processor.batch_decode(
    [out[len(inp):] for inp, out in zip(inputs.input_ids, generated_ids)],
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)

print(output_text)

Intended Use

  • Multimodal research and vision-language evaluation
  • Image captioning and dataset generation pipelines
  • Red-teaming and robustness testing of VLMs
  • Creative and descriptive visual storytelling tasks
  • AI system prototyping with image-text reasoning components

Limitations & Risks

Important Note: This model inherits behavioral characteristics from its base architecture and fine-tuning process.

  • Performance depends on image quality, prompt clarity, and decoding settings
  • May produce incomplete or inconsistent reasoning in complex visual scenes
  • Requires moderate to high VRAM for stable inference depending on resolution
  • Output quality varies across domains such as medical, artistic, or technical imagery
Downloads last month
956
Safetensors
Model size
9B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for prithivMLmods/Qwen3-VL-8B-Instruct-Unredacted-MAX

Finetuned
(291)
this model
Finetunes
1 model
Quantizations
6 models

Space using prithivMLmods/Qwen3-VL-8B-Instruct-Unredacted-MAX 1

Collection including prithivMLmods/Qwen3-VL-8B-Instruct-Unredacted-MAX