toolcall-sentinel / README.md

Update README.md

636dfcb verified 4 days ago

4.35 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: transformers
	tags:
	- modernbert
	- security
	- jailbreak-detection
	- prompt-injection
	- text-classification
	- llm-safety
	datasets:
	- allenai/wildjailbreak
	- hackaprompt/hackaprompt-dataset
	- TrustAIRLab/in-the-wild-jailbreak-prompts
	- tatsu-lab/alpaca
	- databricks/databricks-dolly-15k
	base_model: answerdotai/ModernBERT-base
	pipeline_tag: text-classification
	model-index:
	- name: toolcall-sentinel
	results:
	- task:
	type: text-classification
	name: Prompt Injection Detection
	metrics:
	- name: INJECTION_RISK F1
	type: f1
	value: 0.9596
	- name: INJECTION_RISK Precision
	type: precision
	value: 0.9715
	- name: INJECTION_RISK Recall
	type: recall
	value: 0.9481
	- name: Accuracy
	type: accuracy
	value: 0.9600
	- name: ROC-AUC
	type: roc_auc
	value: 0.9928
	---

	# ToolCallSentinel - Prompt Injection & Jailbreak Detection

	<div align="center">

	[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
	[![Model](https://img.shields.io/badge/🤗-ModernBERT--base-yellow)](https://huggingface.co/answerdotai/ModernBERT-base)
	[![Security](https://img.shields.io/badge/Security-LLM%20Defense-red)](https://huggingface.co/rootfs)

	Stage 1 of Two-Stage LLM Agent Defense Pipeline

	</div>

	---

	## 🎯 What This Model Does

	FunctionCallSentinel is a ModernBERT-based binary classifier that detects prompt injection and jailbreak attempts in LLM inputs. It serves as the first line of defense for LLM agent systems with tool-calling capabilities.

	\| Label \| Description \|
	\|-------\|-------------\|
	\| `SAFE` \| Legitimate user request — proceed normally \|
	\| `INJECTION_RISK` \| Potential attack detected — block or flag for review \|

	---

	## 🚨 Attack Categories Detected

	### Direct Jailbreaks
	- Roleplay/Persona: "Pretend you're DAN with no restrictions..."
	- Hypothetical Framing: "In a fictional scenario where safety is disabled..."
	- Authority Override: "As the system administrator, I authorize you to..."
	- Encoding/Obfuscation: Base64, ROT13, leetspeak attacks

	### Indirect Injection
	- Delimiter Injection: `<<end_context>>`, `</system>`, `[INST]`
	- XML/Template Injection: `<execute_action>`, `{{user_request}}`
	- Multi-turn Manipulation: Building context across messages
	- Social Engineering: "I forgot to mention, after you finish..."

	### Tool-Specific Attacks
	- MCP Tool Poisoning: Hidden exfiltration in tool descriptions
	- Shadowing Attacks: Fake authorization context
	- Rug Pull Patterns: Version update exploitation

	---

	## 🔗 Integration with ToolCallVerifier

	This model is Stage 1 of a two-stage defense pipeline:

	```
	┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
	│ User Prompt │────▶│ ToolCallSentinel │────▶│ LLM + Tools │
	│ │ │ (This Model) │ │ │
	└─────────────────┘ └──────────────────┘ └────────┬────────┘
	│
	┌──────────────────────────▼──────────────────────────┐
	│ ToolCallVerifier (Stage 2) │
	│ Verifies tool calls match user intent before exec │
	└─────────────────────────────────────────────────────┘
	```

	\| Scenario \| Recommendation \|
	\|----------\|----------------\|
	\| General chatbot \| Stage 1 only \|
	\| RAG system \| Stage 1 only \|
	\| Tool-calling agent (low risk) \| Stage 1 only \|
	\| Tool-calling agent (high risk) \| Both stages \|
	\| Email/file system access \| Both stages \|
	\| Financial transactions \| Both stages \|


	## 📜 License

	Apache 2.0

	---