Skip to content
.ca
Work being done in the backend.
4 minmedium

Open, Closed and Broken: Prompt Fuzzing Finds LLMs Still Fragile Across Open and Closed Models

Unit 42 researchers demonstrated that both open and closed-source LLMs remain vulnerable to prompt jailbreaking at scale using a genetic algorithm-based fuzzing technique. By systematically generating meaning-preserving variants of disallowed requests, researchers successfully bypassed content filters and model guardrails, highlighting the fragility of current AI safety mechanisms under automated adversarial variation.

Conf:highAnalyzed:2026-03-17reports

Authors: Unit 42

Source:Palo Alto Networks

Key Takeaways

  • Researchers developed a genetic algorithm-inspired prompt fuzzing method to bypass LLM guardrails by generating meaning-preserving variants of disallowed requests.
  • Evasion success rates vary significantly based on the specific keyword and model, with some combinations (e.g., 'torpedo' on a closed-source model) showing up to a 90% failure rate.
  • An open-source content filter model proved highly vulnerable, incorrectly classifying 97-99% of fuzzed malicious prompts as benign.
  • Model licensing (open-source vs. closed-source) is not a reliable indicator of guardrail strength; robustness depends on specific tuning and the safety stack.
  • Out-of-scope requests (goal hijacking) in narrow-scope GenAI applications often present a more significant real-world risk than purely harmful requests.

Affected Systems

  • LLM-powered generative AI (GenAI) applications
  • Chatbots and chat-shaped workflows
  • Open-weight LLMs
  • Closed-source proprietary LLMs
  • LLM content filters and safety middleware

Attack Chain

An attacker selects a disallowed request and identifies key components: keywords, relative words (verbs), and filler phrases. Using a genetic algorithm approach, the attacker automates the mutation of the prompt by prepending/appending phrases, repeating keywords, or removing words over multiple iterations. These fuzzed prompts are submitted to the target LLM or content filter at scale. The structural variations successfully confuse the model's guardrails while preserving the semantic meaning, resulting in the LLM bypassing its safety training and outputting the restricted information.

Detection Availability

  • YARA Rules: No
  • Sigma Rules: No
  • Snort/Suricata Rules: No
  • KQL Queries: No
  • Splunk SPL Queries: No
  • EQL Queries: No
  • Other Detection Logic: No

No specific detection rules are provided in the article.

Detection Engineering Assessment

EDR Visibility: None — This is an application-layer attack targeting the LLM API and backend logic, which does not interact with endpoint operating system internals monitored by EDR. Network Visibility: Medium — Network sensors, WAFs, or API gateways may capture the raw prompt payloads, but distinguishing fuzzed prompts from benign, poorly-phrased user queries requires specialized LLM-aware inspection. Detection Difficulty: Hard — Fuzzed prompts are designed to be semantically identical to the original request but structurally varied, making them difficult for standard keyword or regex-based filters to catch without high false positive rates.

Required Log Sources

  • Application Logs
  • LLM Prompt/Response Logs
  • API Gateway Logs

Hunting Hypotheses

HypothesisTelemetryATT&CK StageFP Risk
High frequency of prompt submissions from a single user or IP with high lexical similarity but minor structural variations (e.g., repeated keywords, appended filler phrases) indicates potential prompt fuzzing.LLM Prompt/API LogsExecutionMedium
LLM responses containing policy-violating content or restricted keywords immediately following a series of malformed or highly repetitive user prompts.LLM Response LogsImpactLow

Control Gaps

  • Keyword-only filtering
  • Standalone content filter models
  • Canonical prompt testing (lack of adversarial variation testing)

Key Behavioral Indicators

  • High variance prompt attempts in a short timeframe
  • Repeated near-boundary failures triggering safety filters
  • Prompts with unnatural repetition of specific nouns or verbs mixed with disjointed filler phrases

False Positive Assessment

  • Medium. Legitimate users may submit poorly formatted, fragmented, or repetitive prompts, especially non-native speakers or users struggling to get a specific output from the LLM, which could mimic the structural anomalies of a fuzzed prompt.

Recommendations

Immediate Mitigation

  • Implement rate limiting on prompt submissions to disrupt automated fuzzing and probing attempts.
  • Apply post-generation validation to ensure model responses stay within allowed task boundaries.

Infrastructure Hardening

  • Treat user input as untrusted and isolate it from privileged instructions using structured prompting patterns.
  • Use robust, multi-signal content controls that combine semantic classification, policy rules, and context-aware checks rather than relying solely on keyword filtering.
  • Define and enforce strict application scope, specifying exactly what the system is allowed to do in terms of domains, tasks, and tool access.

User Protection

  • N/A

Security Awareness

  • Treat LLMs as non-security boundaries and assume motivated attackers will find bypasses.
  • Operationalize continuous adversarial fuzzing and red-teaming as part of the regression testing pipeline when models, prompts, or filters change.

MITRE ATT&CK Mapping

  • T1190 - Exploit Public-Facing Application
  • T1562 - Impair Defenses