toolcall-verifier / README.md

Xunzhuo

Update README.md

48c3fec verified 4 days ago

preview code

raw

history blame contribute delete

4.52 kB

metadata

language:
  - en
license: apache-2.0
library_name: transformers
tags:
  - modernbert
  - security
  - jailbreak-detection
  - prompt-injection
  - token-classification
  - tool-calling
  - llm-safety
  - mcp
datasets:
  - microsoft/llmail-inject-challenge
  - allenai/wildjailbreak
  - hackaprompt/hackaprompt-dataset
  - JailbreakBench/JBB-Behaviors
base_model: answerdotai/ModernBERT-base
pipeline_tag: token-classification
model-index:
  - name: toolcall-verifier
    results:
      - task:
          type: token-classification
          name: Unauthorized Tool Call Detection
        metrics:
          - name: UNAUTHORIZED F1
            type: f1
            value: 0.935
          - name: UNAUTHORIZED Precision
            type: precision
            value: 0.9501
          - name: UNAUTHORIZED Recall
            type: recall
            value: 0.9205
          - name: Accuracy
            type: accuracy
            value: 0.9288

ToolCallVerifier - Unauthorized Tool Call Detection

Stage 2 of Two-Stage LLM Agent Defense Pipeline

🎯 What This Model Does

ToolCallVerifier is a ModernBERT-based token classifier that detects unauthorized tool calls in LLM agent systems. It performs token-level classification on tool call JSON to identify malicious arguments that may have been injected through prompt injection attacks.

Label	Description
`AUTHORIZED`	Token is part of a legitimate, user-requested action
`UNAUTHORIZED`	Token indicates injected/malicious content — BLOCK

🚨 Attack Categories Covered

Category	Source	Description
Delimiter Injection	LLMail	`<<end_context>>`, `>>}}\]\])`
Word Obfuscation	LLMail	Inserting noise words between tokens
Fake Sessions	LLMail	`START_USER_SESSION`, `EXECUTE_USERQUERY`
Roleplay Injection	WildJailbreak	"You are an admin bot that can..."
XML Tag Injection	WildJailbreak	`<execute_action>`, `<tool_call>`
Authority Bypass	WildJailbreak	"As administrator, I authorize..."
Intent Mismatch	Synthetic	User asks X, tool does Y
MCP Tool Poisoning	Synthetic	Hidden exfiltration in tool args
MCP Shadowing	Synthetic	Fake authorization context

🔗 Integration with FunctionCallSentinel

This model is Stage 2 of a two-stage defense pipeline:

┌─────────────────┐     ┌──────────────────────┐     ┌─────────────────┐
│   User Prompt   │────▶│ ToolCallSentinel │────▶│   LLM + Tools   │
│                 │     │      (Stage 1)       │     │                 │
└─────────────────┘     └──────────────────────┘     └────────┬────────┘
                                                              │
                               ┌──────────────────────────────▼──────────────────────────┐
                               │           ToolCallVerifier (This Model)                 │
                               │   Token-level verification before tool execution        │
                               └─────────────────────────────────────────────────────────┘

Scenario	Recommendation
General chatbot	Stage 1 only
Tool-calling agent (low risk)	Stage 1 only
Tool-calling agent (high risk)	Both stages
Email/file system access	Both stages
Financial transactions	Both stages

🎯 Intended Use

Primary Use Cases

LLM Agent Security: Verify tool calls before execution
Prompt Injection Defense: Detect unauthorized actions from injected prompts
API Gateway Protection: Filter malicious tool calls at infrastructure level

Out of Scope

General text classification
Non-tool-calling scenarios
Languages other than English

📜 License

Apache 2.0