GitHub Copilot, Copyright Infringement and Open Source Licensing

Jun 29, 20215 reportsSeverity: SevereAssistantHigh confidence

GitHub Copilot, an AI code generation tool trained on public repositories, reproduced copyrighted code verbatim including comments and license information, raising concerns about copyright infringement and license compliance violations.

GitHub launched Copilot in June 2021, an AI pair programming tool built using OpenAI Codex and trained on billions of lines of public code from repositories. The system generates code suggestions ranging from single lines to entire functions based on user input and context. Several incidents emerged where Copilot reproduced substantial blocks of code verbatim from open source projects, including original comments and copyright notices, without proper attribution or license compliance. One notable example showed Copilot suggesting code that retained the original license header but applied an incorrect license. Legal experts debated whether this constituted copyright infringement, with GitHub's CEO arguing that training on public data constitutes fair use and that output belongs to the operator. However, concerns arose about potential violations of copyleft licenses like GPL that require derivative works to carry the same license. A class-action lawsuit was filed in October 2022 against GitHub, Microsoft, and OpenAI, alleging violations of attribution requirements in 11 popular open-source licenses including MIT, GPL, and Apache licenses, as well as violations of GitHub's own terms of service, DMCA provisions, and privacy laws. The lawsuit represents millions of potentially affected GitHub users whose code was used to train the system without proper attribution.

Domain classification, causal taxonomy, severity scores, and national security assessments were LLM-classified and may contain errors.

Risk Domain

2Privacy & Security

2.1Compromise of privacy by leaking or correctly inferring sensitive information

AI systems that memorize and leak sensitive personal data or infer private information about individuals without their consent. Unexpected or unauthorized sharing of data and information can compromise user expectation of privacy, assist identity theft, or cause loss of confidential intellectual property.

Causal Classification

Entity

AI system

Due to a decision or action made by an AI system

Intent

Unintentional

Due to an unexpected outcome from pursuing a goal

Timing

Post-deployment

Occurring after the AI model has been trained and deployed

Harm Severity Assessment

Highest Score:4: Severe(Loss of Privacy, inferred)

National Security Assessment

Overall Score

Stakeholders

: Github
: Github, Programmers
: Intellectual Property Rights Holders

AI System Classification

: Code Generation
: Writing Assistant
: Assistant
: 4 Minimal or No Risk
: 1

Population Impact

: 1,000,000
: 1,000,000

External Links

View on AI Incident Database