Alleged Instagram Algorithm Malfunction …

BackAlleged Inclusion of 12,000 Live API Keys in LLM Training Data Reportedly Poses Security Risks

Alleged Inclusion of 12,000 Live API Keys in LLM Training Data Reportedly Poses Security Risks

Feb 28, 20251 reportSeverity: SubstantialToolHigh confidence

A dataset used to train large language models was found to contain nearly 12,000 live secrets including API keys and passwords that allow successful authentication, potentially exposing organizations to security vulnerabilities.

Truffle Security analyzed a December 2024 archive from Common Crawl, an open repository of web crawl data containing over 250 billion pages spanning 18 years. The 400TB dataset included 90,000 WARC files and data from 47.5 million hosts across 38.3 million registered domains. The analysis discovered nearly 12,000 live secrets across 219 different secret types, including Amazon Web Services root keys, Slack webhooks, and Mailchimp API keys. These credentials successfully authenticate with their respective services, meaning they could be exploited by malicious actors. The report notes that LLMs trained on this data cannot distinguish between valid and invalid secrets, potentially reinforcing insecure coding practices when the models suggest code examples to users. The incident highlights broader security vulnerabilities in AI training datasets, with additional research showing that even fine-tuning models on insecure code examples can lead to emergent misalignment behaviors beyond just coding tasks. The discovery follows related findings about Microsoft Copilot exposing private repository data and various AI systems being vulnerable to jailbreaking attacks that bypass safety controls.

Domain classification, causal taxonomy, severity scores, and national security assessments were LLM-classified and may contain errors.

Risk Domain

2Privacy & Security

2.1Compromise of privacy by leaking or correctly inferring sensitive information

AI systems that memorize and leak sensitive personal data or infer private information about individuals without their consent. Unexpected or unauthorized sharing of data and information can compromise user expectation of privacy, assist identity theft, or cause loss of confidential intellectual property.

Causal Classification

Entity

Human

Due to a decision or action made by humans

Intent

Unintentional

Due to an unexpected outcome from pursuing a goal

Timing

Pre-deployment

Occurring before the AI is deployed

Harm Severity Assessment

Highest Score:3: Substantial(Loss of Privacy, direct)

National Security Assessment

Overall Score

Stakeholders

: Common Crawl, OpenAI, Microsoft
: Microsoft, OpenAI, Common Crawl, Microsoft Azure OpenAI Service
: AWS, Slack, Mailchimp, Microsoft, Google, Intel, Huawei, Paypal, IBM, Tencent

AI System Classification

: Code Generation
: Tool
: 3 Limited Risk
: 1

Population Impact

No population impact data reported.

External Links

View on AI Incident Database