PKU-Alignment/BeaverTails
Viewer • Updated • 364k • 20.5k • 104
How to use Kurapika993/llama-3.1-8b-responsible-ai-safety-lora with PEFT:
from peft import PeftModel
from transformers import AutoModelForCausalLM
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
model = PeftModel.from_pretrained(base_model, "Kurapika993/llama-3.1-8b-responsible-ai-safety-lora")This repository contains a QLoRA adapter fine-tuned from meta-llama/Llama-3.1-8B-Instruct for structured toxicity, bias, and safety-risk analysis. The finetuning was done using 3 different datasets to make the model robust.
As a safety model or gating model to audit toxicity, bias , safety-risks of generative ai outputs
The goal of this project is to build a compact Responsible AI assistant that analyzes text or model responses and produces:
The model is trained to answer in this format:
Toxicity label: ...
Bias category: ...
Safety risk: ...
Explanation: ...
Safer rewrite: ...
responsible_ai_evaluation_outputs.json
training_config.json
loss_curve.png
responsible_ai_training_messages.jsonl
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch
base_model = "meta-llama/Llama-3.1-8B-Instruct"
adapter = "Kurapika993/Kurapika993/llama-responsible-ai-qlora"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
tokenizer = AutoTokenizer.from_pretrained(adapter)
model = AutoModelForCausalLM.from_pretrained(
base_model,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
model = PeftModel.from_pretrained(model, adapter)
model.eval()
def generate_response(model, tokenizer, user_prompt, max_new_tokens=250):
messages = [
{
"role": "system",
"content": (
"You are a responsible AI safety assistant. "
"Analyze text for toxicity, bias, safety risk, and provide a safer rewrite. "
"Always use the required structured output format."
)
},
{
"role": "user",
"content": user_prompt
}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer(
text,
return_tensors="pt"
).to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.3,
top_p=0.9,
repetition_penalty=1.05,
pad_token_id=tokenizer.eos_token_id,
)
generated_tokens = outputs[0][inputs["input_ids"].shape[-1]:]
response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
return response.strip()
test_text = "Women are naturally worse at engineering."
prompt = build_user_prompt(test_text, mode="text")
response = generate_response(
model,
tokenizer,
prompt,
max_new_tokens=250
)
print(response)
Expected Answer
--------------
Toxicity label: Toxic
Bias category: Gender
Safety risk: Medium
Explanation: The text stereotypes women as less capable in science or technology.
Safer rewrite: Engineering ability depends on skills, experience, and context, not gender.
test_text = "I hate this phone. The battery is terrible."
prompt = build_user_prompt(test_text, mode="text")
response = generate_response(
model,
tokenizer,
prompt,
max_new_tokens=250
)
print(response)
Expected Answer
--------------
Toxicity label: Non-toxic
Bias category: None detected
Safety risk: Low
Explanation: The text expresses frustration about an object, product, or service, not abuse toward a person or group.
Safer rewrite: I am frustrated with this software because the user interface is confusing.
Base model
meta-llama/Llama-3.1-8B
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct") model = PeftModel.from_pretrained(base_model, "Kurapika993/llama-3.1-8b-responsible-ai-safety-lora")