A dataset used to train large language models was found to contain nearly 12,000 live secrets including API keys and passwords that allow successful authentication, potentially exposing organizations to security vulnerabilities.
Truffle Security analyzed a December 2024 archive from Common Crawl, an open repository of web crawl data containing over 250 billion pages spanning 18 years. The 400TB dataset included 90,000 WARC files and data from 47.5 million hosts across 38.3 million registered domains. The analysis discovered nearly 12,000 live secrets across 219 different secret types, including Amazon Web Services root keys, Slack webhooks, and Mailchimp API keys. These credentials successfully authenticate with their respective services, meaning they could be exploited by malicious actors. The report notes that LLMs trained on this data cannot distinguish between valid and invalid secrets, potentially reinforcing insecure coding practices when the models suggest code examples to users. The incident highlights broader security vulnerabilities in AI training datasets, with additional research showing that even fine-tuning models on insecure code examples can lead to emergent misalignment behaviors beyond just coding tasks. The discovery follows related findings about Microsoft Copilot exposing private repository data and various AI systems being vulnerable to jailbreaking attacks that bypass safety controls.
Domain classification, causal taxonomy, severity scores, and national security assessments were LLM-classified and may contain errors.
AI systems that memorize and leak sensitive personal data or infer private information about individuals without their consent. Unexpected or unauthorized sharing of data and information can compromise user expectation of privacy, assist identity theft, or cause loss of confidential intellectual property.
Human
Due to a decision or action made by humans
Unintentional
Due to an unexpected outcome from pursuing a goal
Pre-deployment
Occurring before the AI is deployed
No population impact data reported.