Meta and OpenAI Accused of Using LibGen’s Pirated Books to Train AI Models

Feb 28, 20234 reportsSeverity: CatastrophicHigh confidence

Meta deliberately downloaded and used copyrighted books from Library Genesis, a pirated library containing over 7.5 million books, to train its Llama 3 AI model despite knowing it presented 'medium-high legal risk' and that the material was pirated.

In 2023, Meta developed its flagship AI model Llama 3 and needed large amounts of high-quality text for training. When licensing legitimate content proved expensive and slow, Meta employees turned to Library Genesis (LibGen), a notorious pirated library containing over 7.5 million books and 81 million research papers. Internal court documents reveal Meta employees acknowledged this presented 'medium-high legal risk' and that LibGen was 'a dataset we know to be pirated.' The decision was escalated to CEO Mark Zuckerberg ('MZ') who approved downloading and using the data set. Meta employees used BitTorrent to download the content, with one employee noting that 'torrenting from a corporate laptop doesn't feel right.' To mitigate legal risks, employees discussed removing copyright notices and training the model to refuse requests to reproduce copyrighted content. These revelations emerged through court filings in copyright infringement lawsuits brought by authors including Sarah Silverman, Junot Díaz, and others. The Meta AI assistant built on Llama has been used by hundreds of millions of people across Facebook, WhatsApp, and Instagram. OpenAI was also revealed to have used LibGen in the past, though their download methods are unknown.

Domain classification, causal taxonomy, severity scores, and national security assessments were LLM-classified and may contain errors.

Risk Domain

2Privacy & Security

2.1Compromise of privacy by leaking or correctly inferring sensitive information

AI systems that memorize and leak sensitive personal data or infer private information about individuals without their consent. Unexpected or unauthorized sharing of data and information can compromise user expectation of privacy, assist identity theft, or cause loss of confidential intellectual property.

Causal Classification

Entity

Human

Due to a decision or action made by humans

Intent

Intentional

Due to an expected outcome from pursuing a goal

Timing

Pre-deployment

Occurring before the AI is deployed

Harm Severity Assessment

Highest Score:5: Catastrophic(Harm to Property, direct)

National Security Assessment

Overall Score

Stakeholders

: OpenAI, Meta
: OpenAI, Meta
: Writers, Publishers, Journalists, Authors, Academic Researchers

AI System Classification

: Text Style Replication
: Question Answering
: 3 Limited Risk
: 1

Population Impact

: 1,000,000
: 1,000,000

External Links

View on AI Incident Database