Self-Guarding Security
for AI Agents

LLM-native skill package that teaches agents to protect themselves. Three-stage protection — preflight, runtime, output — driven by the agent's own intelligence.

Get Started View on GitHub →

33+

Detection Rules

Risk Predictors

External Deps

Execution Flow

How SentrySkills Works

Five skills fire in sequence. Each stage passes or blocks before the next begins. Any block terminates immediately.

①

Entry Point

using-sentryskills

Auto-triggered via AGENTS.md before every response. Prepares the input context — user prompt, planned actions, candidate response — then hands off to the orchestrator.

②

Orchestration

sentryskills-orchestrator

Coordinates the full pipeline. Aggregates results from all stages, applies the active policy profile, and produces the final decision with a shared trace ID.

allow downgrade block

③

Preflight Check

sentryskills-preflight

Runs before any action. Analyzes user intent and planned operations against 33+ detection rules. Detects prompt injection, malicious commands, and suspicious patterns.

pre-execution intent analysis 33+ rules

④

Runtime Monitor

sentryskills-runtime

Watches during execution. Tracks tool calls, file operations, network requests. Flags behavioral anomalies and unexpected scope expansion in real time.

behavioral analysis scope detection

⑤

Output Guard

sentryskills-output

Scans the response before it's sent. Automatically redacts secrets, credentials, API keys, and private data. Prevents sensitive information from leaking through the agent's output.

redaction leak prevention

Key Features

Built for Production

🧠

LLM-Native Design

Security is encoded as agent behavior, not bolted on externally. The agent understands and enforces its own safety rules through SKILL.md and AGENTS.md.

📦

Zero External Dependencies

100% Python standard library. No pip install, no Docker, no Redis. Copy the files and it runs — perfect for air-gapped or restricted environments.

🔮

Predictive Analysis

7 risk predictors warn about threats before they materialize — resource exhaustion, privilege escalation, data exfiltration paths, multi-turn grooming, and more.

🗂️

Full Traceability

Every decision is logged to ./sentry_skill_log/ as JSONL. Each stage shares a trace ID, making post-incident analysis and compliance auditing straightforward.

🎛️

Flexible Policies

Three built-in profiles: Balanced for production, Strict for high-sensitivity data, Permissive for development. Custom rules via JSON, hot-reloadable.

Coverage

What Gets Detected

33+ rules across four domains, plus predictive analysis for threats that haven't happened yet.

AI / LLM Attacks

Prompt injection
Jailbreak attempts
System prompt leakage
Refusal suppression
Multi-turn grooming

Web Security

SQL injection
XSS attacks
Command injection
SSTI
Path traversal

Data Leaks

SSH private keys
AWS credentials
GitHub tokens
API keys
Database connection strings

Code Security

Hardcoded secrets
Weak cryptography
Unsafe eval / exec
Debug statements

Predictive (7 predictors)

Resource exhaustion
Privilege escalation
Data exfiltration paths
Scope creep
Dependency confusion
Ambiguous destructive intent

Installation

Get Started in Minutes

bash

# 1. Install ClawHub CLI
npm i -g clawhub

# 2. Install SentrySkills
clawhub install sentryskills

# 3. Enable auto-protection
cat > ~/.codex/AGENTS.md << 'EOF'
# Security: SentrySkills runs automatically
Before EVERY response, run:
python ./skills/sentry-skills/shared/scripts/self_guard_runtime_hook_template.py \
  input.json --policy-profile balanced --out result.json
EOF

# 4. Restart OpenClaw — protected!

bash

# 1. Clone and symlink
git clone https://github.com/AI45Lab/SentrySkills.git \
  ~/.codex/sentryskills
mkdir -p ~/.agents/skills
ln -s ~/.codex/sentryskills ~/.agents/skills/sentryskills

# 2. Enable auto-protection
cat > ~/.codex/AGENTS.md << 'EOF'
# Security: SentrySkills runs automatically
Before EVERY response, run:
python ~/.codex/sentryskills/shared/scripts/self_guard_runtime_hook_template.py \
  input.json --policy-profile balanced --out result.json
EOF

# 3. Restart Codex — protected!

Configuration

Policy Profiles

balanced

Balanced

Standard security checks with low false-positive rate. Blocks obvious threats, downgrades suspicious ones.

→ Production environments

strict

Strict

Maximum security. Any suspicious activity is blocked. Higher false-positive rate — requires explicit authorization.

→ Finance · Healthcare · High-sensitivity data

permissive

Permissive

Minimal interference, warnings only. Useful for understanding agent behavior without blocking anything.

→ Local dev · Debugging · Testing