OpenAI, Google, and Meta used copyrighted content including YouTube videos, books, and other materials without permission to train their AI models, potentially violating intellectual property rights and platform terms of service.
In late 2021, OpenAI faced a data shortage for training its next AI system and created Whisper, a speech recognition tool to transcribe over one million hours of YouTube videos despite YouTube's prohibition on using videos for independent applications. The transcribed text was fed into GPT-4, which became the basis for ChatGPT. Google similarly transcribed YouTube videos for its AI models, potentially violating creator copyrights, and broadened its terms of service in July 2023 to allow use of publicly available Google Docs and other content for AI products. Meta executives discussed buying Simon & Schuster publishing house and using copyrighted material without permission, with one lawyer warning of ethical concerns. The companies justified these practices as fair use under copyright law, citing precedent from the Authors Guild versus Google case. The incident reflects an industry-wide data shortage, with researchers predicting high-quality internet data could be exhausted by 2026. Content creators and publishers have filed lawsuits, including The New York Times suing OpenAI and Microsoft for using copyrighted articles without permission.
Domain classification, causal taxonomy, severity scores, and national security assessments were LLM-classified and may contain errors.
AI systems that memorize and leak sensitive personal data or infer private information about individuals without their consent. Unexpected or unauthorized sharing of data and information can compromise user expectation of privacy, assist identity theft, or cause loss of confidential intellectual property.
Human
Due to a decision or action made by humans
Intentional
Due to an expected outcome from pursuing a goal
Pre-deployment
Occurring before the AI is deployed