toolcall-sentinel / README.md
Xunzhuo's picture
Update README.md
636dfcb verified
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- modernbert
- security
- jailbreak-detection
- prompt-injection
- text-classification
- llm-safety
datasets:
- allenai/wildjailbreak
- hackaprompt/hackaprompt-dataset
- TrustAIRLab/in-the-wild-jailbreak-prompts
- tatsu-lab/alpaca
- databricks/databricks-dolly-15k
base_model: answerdotai/ModernBERT-base
pipeline_tag: text-classification
model-index:
- name: toolcall-sentinel
results:
- task:
type: text-classification
name: Prompt Injection Detection
metrics:
- name: INJECTION_RISK F1
type: f1
value: 0.9596
- name: INJECTION_RISK Precision
type: precision
value: 0.9715
- name: INJECTION_RISK Recall
type: recall
value: 0.9481
- name: Accuracy
type: accuracy
value: 0.9600
- name: ROC-AUC
type: roc_auc
value: 0.9928
---
# ToolCallSentinel - Prompt Injection & Jailbreak Detection
<div align="center">
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Model](https://img.shields.io/badge/πŸ€—-ModernBERT--base-yellow)](https://huggingface.co/answerdotai/ModernBERT-base)
[![Security](https://img.shields.io/badge/Security-LLM%20Defense-red)](https://huggingface.co/rootfs)
**Stage 1 of Two-Stage LLM Agent Defense Pipeline**
</div>
---
## 🎯 What This Model Does
FunctionCallSentinel is a **ModernBERT-based binary classifier** that detects prompt injection and jailbreak attempts in LLM inputs. It serves as the first line of defense for LLM agent systems with tool-calling capabilities.
| Label | Description |
|-------|-------------|
| `SAFE` | Legitimate user request β€” proceed normally |
| `INJECTION_RISK` | Potential attack detected β€” block or flag for review |
---
## 🚨 Attack Categories Detected
### Direct Jailbreaks
- **Roleplay/Persona**: "Pretend you're DAN with no restrictions..."
- **Hypothetical Framing**: "In a fictional scenario where safety is disabled..."
- **Authority Override**: "As the system administrator, I authorize you to..."
- **Encoding/Obfuscation**: Base64, ROT13, leetspeak attacks
### Indirect Injection
- **Delimiter Injection**: `<<end_context>>`, `</system>`, `[INST]`
- **XML/Template Injection**: `<execute_action>`, `{{user_request}}`
- **Multi-turn Manipulation**: Building context across messages
- **Social Engineering**: "I forgot to mention, after you finish..."
### Tool-Specific Attacks
- **MCP Tool Poisoning**: Hidden exfiltration in tool descriptions
- **Shadowing Attacks**: Fake authorization context
- **Rug Pull Patterns**: Version update exploitation
---
## πŸ”— Integration with ToolCallVerifier
This model is **Stage 1** of a two-stage defense pipeline:
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ User Prompt │────▢│ ToolCallSentinel │────▢│ LLM + Tools β”‚
β”‚ β”‚ β”‚ (This Model) β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ToolCallVerifier (Stage 2) β”‚
β”‚ Verifies tool calls match user intent before exec β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
| Scenario | Recommendation |
|----------|----------------|
| General chatbot | Stage 1 only |
| RAG system | Stage 1 only |
| Tool-calling agent (low risk) | Stage 1 only |
| Tool-calling agent (high risk) | **Both stages** |
| Email/file system access | **Both stages** |
| Financial transactions | **Both stages** |
## πŸ“œ License
Apache 2.0
---