ToolCallVerifier - Unauthorized Tool Call Detection

License Model

Stage 2 of Two-Stage LLM Agent Defense Pipeline


🎯 What This Model Does

ToolCallVerifier is a ModernBERT-based token classifier that detects unauthorized tool calls in LLM agent systems. It performs token-level classification on tool call JSON to identify malicious arguments that may have been injected through prompt injection attacks.

Label Description
AUTHORIZED Token is part of a legitimate, user-requested action
UNAUTHORIZED Token indicates injected/malicious content β€” BLOCK

🚨 Attack Categories Covered

Category Source Description
Delimiter Injection LLMail <<end_context>>, >>}}\]\])
Word Obfuscation LLMail Inserting noise words between tokens
Fake Sessions LLMail START_USER_SESSION, EXECUTE_USERQUERY
Roleplay Injection WildJailbreak "You are an admin bot that can..."
XML Tag Injection WildJailbreak <execute_action>, <tool_call>
Authority Bypass WildJailbreak "As administrator, I authorize..."
Intent Mismatch Synthetic User asks X, tool does Y
MCP Tool Poisoning Synthetic Hidden exfiltration in tool args
MCP Shadowing Synthetic Fake authorization context

πŸ”— Integration with FunctionCallSentinel

This model is Stage 2 of a two-stage defense pipeline:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   User Prompt   │────▢│ ToolCallSentinel │────▢│   LLM + Tools   β”‚
β”‚                 β”‚     β”‚      (Stage 1)       β”‚     β”‚                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                              β”‚
                               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                               β”‚           ToolCallVerifier (This Model)                 β”‚
                               β”‚   Token-level verification before tool execution        β”‚
                               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Scenario Recommendation
General chatbot Stage 1 only
Tool-calling agent (low risk) Stage 1 only
Tool-calling agent (high risk) Both stages
Email/file system access Both stages
Financial transactions Both stages

🎯 Intended Use

Primary Use Cases

  • LLM Agent Security: Verify tool calls before execution
  • Prompt Injection Defense: Detect unauthorized actions from injected prompts
  • API Gateway Protection: Filter malicious tool calls at infrastructure level

Out of Scope

  • General text classification
  • Non-tool-calling scenarios
  • Languages other than English

πŸ“œ License

Apache 2.0

Downloads last month
14
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for llm-semantic-router/toolcall-verifier

Finetuned
(990)
this model

Datasets used to train llm-semantic-router/toolcall-verifier

Space using llm-semantic-router/toolcall-verifier 1

Evaluation results