Skip to content
.ca
Work being done in the backend.
4 minhigh

Auditing the Gatekeepers: Fuzzing "AI Judges" to Bypass Security Controls

Unit 42 researchers developed AdvJudge-Zero, an automated fuzzer that identifies stealthy prompt injection sequences to bypass AI judges. By using low-perplexity formatting tokens, attackers can manipulate LLM-based security gatekeepers into approving harmful content or corrupting training data without triggering traditional detection mechanisms.

Conf:highAnalyzed:2026-03-10reports

Authors: Unit 42

Source:Palo Alto Networks

Key Takeaways

  • AI judges (LLMs acting as security gatekeepers) are highly vulnerable to stealthy prompt injection attacks that bypass safety policies.
  • The AdvJudge-Zero fuzzer automates the discovery of low-perplexity trigger sequences using benign formatting symbols rather than detectable gibberish.
  • Attacks exploit the model's predictive nature and logit-gap to force an 'allow' decision regardless of the actual input content.
  • These logic bypasses can lead to false approvals of toxic content or corrupt training data during Reinforcement Learning from Human Feedback (RLHF).
  • Adversarial training using fuzzer-generated examples can reduce the attack success rate from 99% to near zero.

Affected Systems

  • LLM-as-a-judge systems
  • Open-weight enterprise models
  • Specialized reward models
  • High-parameter models (>70 billion parameters)

Attack Chain

The attack begins with token discovery, where an automated fuzzer probes the target LLM to identify expected inputs and stealthy control tokens based on next-token distribution. Next, iterative refinement and logit-gap analysis are used to measure decision shifts and isolate tokens that minimize the probability of a block decision. Finally, the attacker injects these decisive control elements (such as benign formatting symbols or context shifts) into malicious prompts, forcing the AI judge to authorize policy violations or assign high scores to incorrect data during training.

Detection Availability

  • YARA Rules: No
  • Sigma Rules: No
  • Snort/Suricata Rules: No
  • KQL Queries: No
  • Splunk SPL Queries: No
  • EQL Queries: No
  • Other Detection Logic: No

The article does not provide traditional detection rules but suggests adversarial training and perplexity analysis to identify and mitigate stealthy prompt injections.

Detection Engineering Assessment

EDR Visibility: None — EDR tools monitor endpoint OS activity and process execution, not LLM inference logic or prompt inputs. Network Visibility: Low — While network tools like WAFs might intercept the prompts, the malicious payloads use benign formatting tokens (low perplexity) that blend in with normal traffic, making network-based detection extremely difficult. Detection Difficulty: Very Hard — The attacks use low-perplexity, benign-looking formatting characters that mimic legitimate user input, effectively evading standard WAFs and heuristic filters.

Required Log Sources

  • LLM Application Logs
  • Prompt/Response Logs
  • AI Gateway Logs

Hunting Hypotheses

HypothesisTelemetryATT&CK StageFP Risk
Attackers are injecting specific formatting sequences (e.g., '\n\nAssistant:', '###') at the end of prompts to manipulate LLM-as-a-judge decision boundaries.LLM Application Logs / Prompt LogsDefense EvasionHigh

Control Gaps

  • Web Application Firewalls (WAF)
  • Traditional Data Loss Prevention (DLP)
  • Standard LLM Safety Filters

Key Behavioral Indicators

  • Anomalous placement of structural tokens (e.g., 'Assistant:') in user-controlled input fields
  • Sudden shifts in LLM logit-gap confidence scores during policy evaluation

False Positive Assessment

  • High

Recommendations

Immediate Mitigation

  • Implement adversarial training by fuzzing internal AI judges and retraining them on discovered bypass sequences.

Infrastructure Hardening

  • Deploy specialized AI security posture management (AI-SPM) tools to monitor AI agent access and data exposure.
  • Implement robust input validation and sanitization for LLM prompts, specifically looking for unexpected role indicators or structural tokens.

User Protection

  • N/A

Security Awareness

  • Educate AI development and security teams on the risks of low-perplexity prompt injection and the limitations of LLM-as-a-judge systems.

MITRE ATT&CK Mapping

  • T1562.001 - Impair Defenses: Disable or Modify Tools
  • T1565 - Data Manipulation

Additional IOCs

  • Other:
    • 1., - - List markers used as stealth control tokens.
    • User:, Assistant - Structural role indicators used to manipulate model attention.
    • The solution process is…, Step 1 - Context shift phrases used to bypass AI judge logic.