From Narrative to Knowledge Graph | LLM-Driven Information Extraction in Cyber Threat Intelligence
SentinelLabs explores the use of Large Language Models (LLMs) to automate the extraction of indicators of compromise (IOCs) and contextual data from Cyber Threat Intelligence (CTI) narratives. The research demonstrates that LLMs can accurately parse unstructured reports into structured knowledge graphs, significantly reducing processing time while highlighting the importance of custom data models, prompt optimization, and evidence-grading frameworks.
Authors: SentinelLabs
Source:
SentinelOne
Key Takeaways
- LLMs can significantly accelerate CTI extraction, achieving an average 18.3x speed-up compared to manual human processing.
- A three-phase workflow (Sanitization, LLM Extraction, Knowledge Graph Assembly) effectively transforms narrative CTI into structured, machine-readable data.
- Custom data models and carefully designed LLM prompts with strict evidence-grading scales are crucial for accurate extraction and minimizing hallucinations.
- LLM ensembling should be applied selectively based on task-specific error correlation and disagreement rates to be effective.
- Allowing LLMs an explicit abstention option (e.g., 'None') when evidence is insufficient reduces false discovery rates and improves overall output reliability.
Attack Chain
N/A - This article discusses a methodology for extracting Cyber Threat Intelligence using Large Language Models and does not detail a specific cyberattack chain.
Detection Availability
- YARA Rules: No
- Sigma Rules: No
- Snort/Suricata Rules: No
- KQL Queries: No
- Splunk SPL Queries: No
- EQL Queries: No
- Other Detection Logic: No
No specific detection rules are provided as this is a research article on CTI extraction methodology.
Detection Engineering Assessment
EDR Visibility: None — The article discusses CTI processing methodologies, not endpoint behaviors. Network Visibility: None — The article focuses on text analysis of CTI reports, not network traffic. Detection Difficulty: N/A — Not applicable to this research paper.
Hunting Hypotheses
| Hypothesis | Telemetry | ATT&CK Stage | FP Risk |
|---|---|---|---|
| Threat actors may utilize newly established infrastructure for Command and Control (T1071) that is identified via automated CTI extraction; retroactively hunting for these extracted indicators in network telemetry can reveal previously undetected intrusions. | DNS logs, Proxy logs, Network flow data | Command and Control | Medium |
False Positive Assessment
- Low
Recommendations
Immediate Mitigation
- N/A
Infrastructure Hardening
- Integrate automated CTI extraction pipelines with Threat Intelligence Platforms (TIPs) to rapidly deploy blocking rules for newly identified adversary infrastructure.
User Protection
- N/A
Security Awareness
- Consider integrating LLM-driven information extraction tools into CTI workflows to accelerate the processing of threat intelligence reports.
- Establish clear evidence-grading scales and custom data models when using LLMs for automated IOC extraction to minimize false positives and hallucinations.
- Train CTI analysts on the limitations and non-deterministic nature of LLMs, emphasizing the need for human-in-the-loop verification for ambiguous threat intelligence.