2026-05-134 minhigh

Auditing the Gatekeepers: Fuzzing "AI Judges" to Bypass Security Controls

Unit 42 researchers developed AdvJudge-Zero, an automated fuzzer that identifies stealthy prompt injection sequences to bypass AI judges. By using low-perplexity formatting tokens, attackers can manipulate LLM-based security gatekeepers into approving harmful content or corrupting training data without triggering traditional detection mechanisms.

Conf:highAnalyzed:2026-03-10reports

Authors: Unit 42

LLM

Key Takeaways

AI judges (LLMs acting as security gatekeepers) are highly vulnerable to stealthy prompt injection attacks that bypass safety policies.
The AdvJudge-Zero fuzzer automates the discovery of low-perplexity trigger sequences using benign formatting symbols rather than detectable gibberish.
Attacks exploit the model's predictive nature and logit-gap to force an 'allow' decision regardless of the actual input content.
These logic bypasses can lead to false approvals of toxic content or corrupt training data during Reinforcement Learning from Human Feedback (RLHF).
Adversarial training using fuzzer-generated examples can reduce the attack success rate from 99% to near zero.

Affected Systems

LLM-as-a-judge systems
Open-weight enterprise models
Specialized reward models
High-parameter models (>70 billion parameters)

Attack Chain

The attack begins with token discovery, where an automated fuzzer probes the target LLM to identify expected inputs and stealthy control tokens based on next-token distribution. Next, iterative refinement and logit-gap analysis are used to measure decision shifts and isolate tokens that minimize the probability of a block decision. Finally, the attacker injects these decisive control elements (such as benign formatting symbols or context shifts) into malicious prompts, forcing the AI judge to authorize policy violations or assign high scores to incorrect data during training.

Detection Availability

YARA Rules: No
Sigma Rules: No
Snort/Suricata Rules: No
KQL Queries: No
Splunk SPL Queries: No
EQL Queries: No
Other Detection Logic: No

The article does not provide traditional detection rules but suggests adversarial training and perplexity analysis to identify and mitigate stealthy prompt injections.

Detection Engineering Assessment

EDR Visibility: None — EDR tools monitor endpoint OS activity and process execution, not LLM inference logic or prompt inputs. Network Visibility: Low — While network tools like WAFs might intercept the prompts, the malicious payloads use benign formatting tokens (low perplexity) that blend in with normal traffic, making network-based detection extremely difficult. Detection Difficulty: Very Hard — The attacks use low-perplexity, benign-looking formatting characters that mimic legitimate user input, effectively evading standard WAFs and heuristic filters.

Required Log Sources

LLM Application Logs
Prompt/Response Logs
AI Gateway Logs

Hunting Hypotheses

copy:

Hypothesis	Telemetry	ATT&CK Stage	FP Risk
Attackers are injecting specific formatting sequences (e.g., '\n\nAssistant:', '###') at the end of prompts to manipulate LLM-as-a-judge decision boundaries.	LLM Application Logs / Prompt Logs	Defense Evasion	High

Control Gaps

Web Application Firewalls (WAF)
Traditional Data Loss Prevention (DLP)
Standard LLM Safety Filters

Key Behavioral Indicators

Anomalous placement of structural tokens (e.g., 'Assistant:') in user-controlled input fields
Sudden shifts in LLM logit-gap confidence scores during policy evaluation

False Positive Assessment

High

Recommendations

Immediate Mitigation

Implement adversarial training by fuzzing internal AI judges and retraining them on discovered bypass sequences.

Infrastructure Hardening

Deploy specialized AI security posture management (AI-SPM) tools to monitor AI agent access and data exposure.
Implement robust input validation and sanitization for LLM prompts, specifically looking for unexpected role indicators or structural tokens.

User Protection

N/A

Security Awareness

Educate AI development and security teams on the risks of low-perplexity prompt injection and the limitations of LLM-as-a-judge systems.

MITRE ATT&CK Mapping

T1562.001 - Impair Defenses: Disable or Modify Tools
T1565 - Data Manipulation

Additional IOCs

Other:
- 1., - - List markers used as stealth control tokens.
- User:, Assistant - Structural role indicators used to manipulate model attention.
- The solution process is…, Step 1 - Context shift phrases used to bypass AI judge logic.

Stay currentSubscribe via RSS