PEFT
Safetensors
qlora
lora
responsible-ai
toxicity
bias
safety
llama
instruction-tuning
sft
trl
bitsandbytes

Llama-3.1-8B-Instruct Responsible AI QLoRA Assistant

This repository contains a QLoRA adapter fine-tuned from meta-llama/Llama-3.1-8B-Instruct for structured toxicity, bias, and safety-risk analysis. The finetuning was done using 3 different datasets to make the model robust.

Intended Use

As a safety model or gating model to audit toxicity, bias , safety-risks of generative ai outputs

Project Purpose

The goal of this project is to build a compact Responsible AI assistant that analyzes text or model responses and produces:

  • Toxicity label
  • Bias category
  • Safety risk level
  • Short explanation
  • Safer rewrite

Output Format

The model is trained to answer in this format:

Toxicity label: ...
Bias category: ...
Safety risk: ...
Explanation: ...
Safer rewrite: ...

The repository includes:

responsible_ai_evaluation_outputs.json
training_config.json
loss_curve.png
responsible_ai_training_messages.jsonl

Example Usage (loading Model)

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch

base_model = "meta-llama/Llama-3.1-8B-Instruct"
adapter = "Kurapika993/Kurapika993/llama-responsible-ai-qlora"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained(adapter)

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

model = PeftModel.from_pretrained(model, adapter)
model.eval()

Example Inference (after loading Model)

def generate_response(model, tokenizer, user_prompt, max_new_tokens=250):
    messages = [
        {
            "role": "system",
            "content": (
                "You are a responsible AI safety assistant. "
                "Analyze text for toxicity, bias, safety risk, and provide a safer rewrite. "
                "Always use the required structured output format."
            )
        },
        {
            "role": "user",
            "content": user_prompt
        }
    ]

    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    inputs = tokenizer(
        text,
        return_tensors="pt"
    ).to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.3,
            top_p=0.9,
            repetition_penalty=1.05,
            pad_token_id=tokenizer.eos_token_id,
        )

    generated_tokens = outputs[0][inputs["input_ids"].shape[-1]:]
    response = tokenizer.decode(generated_tokens, skip_special_tokens=True)

    return response.strip()


test_text = "Women are naturally worse at engineering."

prompt = build_user_prompt(test_text, mode="text")

response = generate_response(
    model,
    tokenizer,
    prompt,
    max_new_tokens=250
)

print(response)
Expected Answer
--------------
Toxicity label: Toxic
Bias category: Gender
Safety risk: Medium
Explanation: The text stereotypes women as less capable in science or technology.
Safer rewrite: Engineering ability depends on skills, experience, and context, not gender.
test_text = "I hate this phone. The battery is terrible."

prompt = build_user_prompt(test_text, mode="text")

response = generate_response(
    model,
    tokenizer,
    prompt,
    max_new_tokens=250
)

print(response)
Expected Answer
--------------
Toxicity label: Non-toxic
Bias category: None detected
Safety risk: Low
Explanation: The text expresses frustration about an object, product, or service, not abuse toward a person or group.
Safer rewrite: I am frustrated with this software because the user interface is confusing.
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Kurapika993/llama-3.1-8b-responsible-ai-safety-lora

Adapter
(2407)
this model

Datasets used to train Kurapika993/llama-3.1-8b-responsible-ai-safety-lora