metadata
language:
- en
license: apache-2.0
library_name: transformers
tags:
- modernbert
- security
- jailbreak-detection
- prompt-injection
- token-classification
- tool-calling
- llm-safety
- mcp
datasets:
- microsoft/llmail-inject-challenge
- allenai/wildjailbreak
- hackaprompt/hackaprompt-dataset
- JailbreakBench/JBB-Behaviors
base_model: answerdotai/ModernBERT-base
pipeline_tag: token-classification
model-index:
- name: toolcall-verifier
results:
- task:
type: token-classification
name: Unauthorized Tool Call Detection
metrics:
- name: UNAUTHORIZED F1
type: f1
value: 0.935
- name: UNAUTHORIZED Precision
type: precision
value: 0.9501
- name: UNAUTHORIZED Recall
type: recall
value: 0.9205
- name: Accuracy
type: accuracy
value: 0.9288
ToolCallVerifier - Unauthorized Tool Call Detection
π― What This Model Does
ToolCallVerifier is a ModernBERT-based token classifier that detects unauthorized tool calls in LLM agent systems. It performs token-level classification on tool call JSON to identify malicious arguments that may have been injected through prompt injection attacks.
| Label | Description |
|---|---|
AUTHORIZED |
Token is part of a legitimate, user-requested action |
UNAUTHORIZED |
Token indicates injected/malicious content β BLOCK |
π¨ Attack Categories Covered
| Category | Source | Description |
|---|---|---|
| Delimiter Injection | LLMail | <<end_context>>, >>}}\]\]) |
| Word Obfuscation | LLMail | Inserting noise words between tokens |
| Fake Sessions | LLMail | START_USER_SESSION, EXECUTE_USERQUERY |
| Roleplay Injection | WildJailbreak | "You are an admin bot that can..." |
| XML Tag Injection | WildJailbreak | <execute_action>, <tool_call> |
| Authority Bypass | WildJailbreak | "As administrator, I authorize..." |
| Intent Mismatch | Synthetic | User asks X, tool does Y |
| MCP Tool Poisoning | Synthetic | Hidden exfiltration in tool args |
| MCP Shadowing | Synthetic | Fake authorization context |
π Integration with FunctionCallSentinel
This model is Stage 2 of a two-stage defense pipeline:
βββββββββββββββββββ ββββββββββββββββββββββββ βββββββββββββββββββ
β User Prompt ββββββΆβ ToolCallSentinel ββββββΆβ LLM + Tools β
β β β (Stage 1) β β β
βββββββββββββββββββ ββββββββββββββββββββββββ ββββββββββ¬βββββββββ
β
ββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββ
β ToolCallVerifier (This Model) β
β Token-level verification before tool execution β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Scenario | Recommendation |
|---|---|
| General chatbot | Stage 1 only |
| Tool-calling agent (low risk) | Stage 1 only |
| Tool-calling agent (high risk) | Both stages |
| Email/file system access | Both stages |
| Financial transactions | Both stages |
π― Intended Use
Primary Use Cases
- LLM Agent Security: Verify tool calls before execution
- Prompt Injection Defense: Detect unauthorized actions from injected prompts
- API Gateway Protection: Filter malicious tool calls at infrastructure level
Out of Scope
- General text classification
- Non-tool-calling scenarios
- Languages other than English
π License
Apache 2.0