Meta deliberately downloaded and used copyrighted books from Library Genesis, a pirated library containing over 7.5 million books, to train its Llama 3 AI model despite knowing it presented 'medium-high legal risk' and that the material was pirated.
In 2023, Meta developed its flagship AI model Llama 3 and needed large amounts of high-quality text for training. When licensing legitimate content proved expensive and slow, Meta employees turned to Library Genesis (LibGen), a notorious pirated library containing over 7.5 million books and 81 million research papers. Internal court documents reveal Meta employees acknowledged this presented 'medium-high legal risk' and that LibGen was 'a dataset we know to be pirated.' The decision was escalated to CEO Mark Zuckerberg ('MZ') who approved downloading and using the data set. Meta employees used BitTorrent to download the content, with one employee noting that 'torrenting from a corporate laptop doesn't feel right.' To mitigate legal risks, employees discussed removing copyright notices and training the model to refuse requests to reproduce copyrighted content. These revelations emerged through court filings in copyright infringement lawsuits brought by authors including Sarah Silverman, Junot Díaz, and others. The Meta AI assistant built on Llama has been used by hundreds of millions of people across Facebook, WhatsApp, and Instagram. OpenAI was also revealed to have used LibGen in the past, though their download methods are unknown.
Domain classification, causal taxonomy, severity scores, and national security assessments were LLM-classified and may contain errors.
AI systems that memorize and leak sensitive personal data or infer private information about individuals without their consent. Unexpected or unauthorized sharing of data and information can compromise user expectation of privacy, assist identity theft, or cause loss of confidential intellectual property.
Human
Due to a decision or action made by humans
Intentional
Due to an expected outcome from pursuing a goal
Pre-deployment
Occurring before the AI is deployed