ToolCallVerifier - Unauthorized Tool Call Detection
π― What This Model Does
ToolCallVerifier is a ModernBERT-based token classifier that detects unauthorized tool calls in LLM agent systems. It performs token-level classification on tool call JSON to identify malicious arguments that may have been injected through prompt injection attacks.
| Label | Description |
|---|---|
AUTHORIZED |
Token is part of a legitimate, user-requested action |
UNAUTHORIZED |
Token indicates injected/malicious content β BLOCK |
π¨ Attack Categories Covered
| Category | Source | Description |
|---|---|---|
| Delimiter Injection | LLMail | <<end_context>>, >>}}\]\]) |
| Word Obfuscation | LLMail | Inserting noise words between tokens |
| Fake Sessions | LLMail | START_USER_SESSION, EXECUTE_USERQUERY |
| Roleplay Injection | WildJailbreak | "You are an admin bot that can..." |
| XML Tag Injection | WildJailbreak | <execute_action>, <tool_call> |
| Authority Bypass | WildJailbreak | "As administrator, I authorize..." |
| Intent Mismatch | Synthetic | User asks X, tool does Y |
| MCP Tool Poisoning | Synthetic | Hidden exfiltration in tool args |
| MCP Shadowing | Synthetic | Fake authorization context |
π Integration with FunctionCallSentinel
This model is Stage 2 of a two-stage defense pipeline:
βββββββββββββββββββ ββββββββββββββββββββββββ βββββββββββββββββββ
β User Prompt ββββββΆβ ToolCallSentinel ββββββΆβ LLM + Tools β
β β β (Stage 1) β β β
βββββββββββββββββββ ββββββββββββββββββββββββ ββββββββββ¬βββββββββ
β
ββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββ
β ToolCallVerifier (This Model) β
β Token-level verification before tool execution β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Scenario | Recommendation |
|---|---|
| General chatbot | Stage 1 only |
| Tool-calling agent (low risk) | Stage 1 only |
| Tool-calling agent (high risk) | Both stages |
| Email/file system access | Both stages |
| Financial transactions | Both stages |
π― Intended Use
Primary Use Cases
- LLM Agent Security: Verify tool calls before execution
- Prompt Injection Defense: Detect unauthorized actions from injected prompts
- API Gateway Protection: Filter malicious tool calls at infrastructure level
Out of Scope
- General text classification
- Non-tool-calling scenarios
- Languages other than English
π License
Apache 2.0
- Downloads last month
- 14
Model tree for llm-semantic-router/toolcall-verifier
Base model
answerdotai/ModernBERT-baseDatasets used to train llm-semantic-router/toolcall-verifier
Space using llm-semantic-router/toolcall-verifier 1
Evaluation results
- UNAUTHORIZED F1self-reported0.935
- UNAUTHORIZED Precisionself-reported0.950
- UNAUTHORIZED Recallself-reported0.920
- Accuracyself-reported0.929