2026-05-134 minmedium

Open, Closed and Broken: Prompt Fuzzing Finds LLMs Still Fragile Across Open and Closed Models

Unit 42 researchers demonstrated that both open and closed-source LLMs remain vulnerable to prompt jailbreaking at scale using a genetic algorithm-based fuzzing technique. By systematically generating meaning-preserving variants of disallowed requests, researchers successfully bypassed content filters and model guardrails, highlighting the fragility of current AI safety mechanisms under automated adversarial variation.

Conf:highAnalyzed:2026-03-17reports

Authors: Unit 42

Key Takeaways

Researchers developed a genetic algorithm-inspired prompt fuzzing method to bypass LLM guardrails by generating meaning-preserving variants of disallowed requests.
Evasion success rates vary significantly based on the specific keyword and model, with some combinations (e.g., 'torpedo' on a closed-source model) showing up to a 90% failure rate.
An open-source content filter model proved highly vulnerable, incorrectly classifying 97-99% of fuzzed malicious prompts as benign.
Model licensing (open-source vs. closed-source) is not a reliable indicator of guardrail strength; robustness depends on specific tuning and the safety stack.
Out-of-scope requests (goal hijacking) in narrow-scope GenAI applications often present a more significant real-world risk than purely harmful requests.

Affected Systems

LLM-powered generative AI (GenAI) applications
Chatbots and chat-shaped workflows
Open-weight LLMs
Closed-source proprietary LLMs
LLM content filters and safety middleware

Attack Chain

An attacker selects a disallowed request and identifies key components: keywords, relative words (verbs), and filler phrases. Using a genetic algorithm approach, the attacker automates the mutation of the prompt by prepending/appending phrases, repeating keywords, or removing words over multiple iterations. These fuzzed prompts are submitted to the target LLM or content filter at scale. The structural variations successfully confuse the model's guardrails while preserving the semantic meaning, resulting in the LLM bypassing its safety training and outputting the restricted information.

Detection Availability

YARA Rules: No
Sigma Rules: No
Snort/Suricata Rules: No
KQL Queries: No
Splunk SPL Queries: No
EQL Queries: No
Other Detection Logic: No

No specific detection rules are provided in the article.

Detection Engineering Assessment

EDR Visibility: None — This is an application-layer attack targeting the LLM API and backend logic, which does not interact with endpoint operating system internals monitored by EDR. Network Visibility: Medium — Network sensors, WAFs, or API gateways may capture the raw prompt payloads, but distinguishing fuzzed prompts from benign, poorly-phrased user queries requires specialized LLM-aware inspection. Detection Difficulty: Hard — Fuzzed prompts are designed to be semantically identical to the original request but structurally varied, making them difficult for standard keyword or regex-based filters to catch without high false positive rates.

Required Log Sources

Application Logs
LLM Prompt/Response Logs
API Gateway Logs

Hunting Hypotheses

copy:

Hypothesis	Telemetry	ATT&CK Stage	FP Risk
High frequency of prompt submissions from a single user or IP with high lexical similarity but minor structural variations (e.g., repeated keywords, appended filler phrases) indicates potential prompt fuzzing.	LLM Prompt/API Logs	Execution	Medium
LLM responses containing policy-violating content or restricted keywords immediately following a series of malformed or highly repetitive user prompts.	LLM Response Logs	Impact	Low

Control Gaps

Keyword-only filtering
Standalone content filter models
Canonical prompt testing (lack of adversarial variation testing)

Key Behavioral Indicators

High variance prompt attempts in a short timeframe
Repeated near-boundary failures triggering safety filters
Prompts with unnatural repetition of specific nouns or verbs mixed with disjointed filler phrases

False Positive Assessment

Medium. Legitimate users may submit poorly formatted, fragmented, or repetitive prompts, especially non-native speakers or users struggling to get a specific output from the LLM, which could mimic the structural anomalies of a fuzzed prompt.

Recommendations

Immediate Mitigation

Implement rate limiting on prompt submissions to disrupt automated fuzzing and probing attempts.
Apply post-generation validation to ensure model responses stay within allowed task boundaries.

Infrastructure Hardening

Treat user input as untrusted and isolate it from privileged instructions using structured prompting patterns.
Use robust, multi-signal content controls that combine semantic classification, policy rules, and context-aware checks rather than relying solely on keyword filtering.
Define and enforce strict application scope, specifying exactly what the system is allowed to do in terms of domains, tasks, and tool access.

User Protection

N/A

Security Awareness

Treat LLMs as non-security boundaries and assume motivated attackers will find bypasses.
Operationalize continuous adversarial fuzzing and red-teaming as part of the regression testing pipeline when models, prompts, or filters change.

MITRE ATT&CK Mapping

T1190 - Exploit Public-Facing Application
T1562 - Impair Defenses

Stay currentSubscribe via RSS