Meta, the tech giant helmed by Mark Zuckerberg, is facing intense scrutiny over allegations that its AI models were trained on pirated content. Plaintiffs in the ongoing copyright infringement case Kadrey v. Meta claim that Zuckerberg personally approved the use of a dataset composed of pirated e-books and articles, a move that has ignited heated debates about ethics, copyright law, and the development of artificial intelligence.
The Core Allegations
The lawsuit, spearheaded by renowned authors Sarah Silverman and Ta-Nehisi Coates, accuses Meta of leveraging copyrighted material without authorization to train its Llama AI models. According to recently unsealed court documents, Meta’s leadership allegedly greenlit the use of a dataset known as LibGen—short for Library Genesis—a controversial online repository that provides access to copyrighted works from major publishers like Pearson Education, McGraw Hill, and Macmillan Learning.
LibGen has faced numerous legal actions in the past for facilitating copyright infringement, with courts ordering its shutdown and imposing heavy fines. Despite its reputation, the dataset allegedly became a core resource for Meta’s AI training efforts.
The newly unredacted documents filed with the U.S. District Court for the Northern District of California recount Meta’s internal discussions on the matter. The filings reveal that some Meta employees referred to LibGen as a “dataset we know to be pirated” and raised concerns that its use could undermine the company’s credibility with regulators. However, these warnings were reportedly overruled after Zuckerberg’s direct approval.
The Zuckerberg Connection
According to the plaintiffs’ filing, Meta’s decision to proceed with LibGen was escalated to Zuckerberg himself, who ultimately gave the go-ahead. An internal memo cited in the filing states that after being “escalated to MZ” (Mark Zuckerberg’s initials), Meta’s AI team was “approved to use LibGen.”
This revelation aligns with earlier reporting that Meta had cut corners in sourcing training data for its AI. A report from April 2023 detailed how Meta hired contractors in Africa to summarize books and even considered acquiring the publisher Simon & Schuster to expedite data acquisition. Ultimately, however, the company opted to rely on datasets like LibGen, citing fair use as its legal justification.
The “Fair Use” Defense vs. Ethical Concerns
The central argument in Meta’s defense is the U.S. doctrine of fair use, which allows copyrighted material to be used in transformative ways. Tech companies like Meta argue that training AI models falls under this category. However, creators and copyright holders have pushed back, asserting that repurposing their works for profit-driven AI development is neither fair nor transformative.
The plaintiffs argue that Meta’s use of LibGen crosses a line, particularly because the company allegedly took active steps to conceal its actions. The filing claims that Meta stripped copyright notices and acknowledgments from the LibGen dataset before using it in training.
For instance, Meta engineer Nikolay Bashlykov reportedly wrote scripts to remove copyright metadata from e-books and scientific articles. Plaintiffs allege this wasn’t merely a technical step for training purposes but a deliberate attempt to obscure infringement, preventing public awareness of the copyrighted origins of Llama’s outputs.
Torrenting Pirated Works: A New Accusation
The filing goes further, alleging that Meta used torrenting—a method of file sharing—to access LibGen, thereby engaging in another form of copyright infringement. Torrenting typically involves both downloading and uploading files, meaning Meta may have unintentionally helped distribute copyrighted materials.
Internal discussions reportedly highlight concerns among Meta engineers about this method. Bashlykov, for example, is quoted as warning that torrenting LibGen “could be legally not OK.” Despite these warnings, Meta’s head of generative AI, Ahmad Al-Dahle, allegedly dismissed the concerns, enabling the torrenting of the dataset.
Why It Matters
The implications of these allegations are far-reaching. If proven true, Meta’s actions could set a troubling precedent for the tech industry, raising questions about accountability, corporate ethics, and the trade-offs made in the race to dominate AI.
The plaintiffs argue that Meta’s shortcuts are particularly egregious because the company had lawful alternatives. “Had Meta bought plaintiffs’ works in a bookstore or borrowed them from a library and trained its Llama models on them without a license, it would have committed copyright infringement,” the filing states. “Meta’s decision to bypass lawful methods … serves as proof of copyright infringement.”
The court has yet to rule on these allegations, and the case only concerns Meta’s earliest Llama models. Still, the revelations have already cast a shadow over Meta’s AI operations, threatening to erode trust in its approach to innovation.
Judicial Pushback Against Secrecy
Meta’s attempts to downplay the controversy have also faced challenges. On Wednesday, Judge Vince Chhabria rejected Meta’s request to redact substantial portions of the plaintiffs’ filing, stating that the request seemed more focused on avoiding bad publicity than on protecting sensitive business information.
“It is clear that Meta’s sealing request is not designed to protect against the disclosure of sensitive business information that competitors could use to their advantage,” Chhabria wrote. “Rather, it is designed to avoid negative publicity.”
This judicial critique adds to the growing public perception that Meta’s actions may have been both legally and ethically dubious.
The Road Ahead
The outcome of Kadrey v. Meta remains uncertain. Courts have dismissed similar copyright claims against AI companies in the past, often ruling that plaintiffs failed to demonstrate specific instances of infringement. However, the depth of evidence presented in this case, coupled with the high-profile nature of the plaintiffs, could lead to a different outcome.
As AI continues to revolutionize industries, the battle over the legal and ethical boundaries of training data is far from over. This case could become a watershed moment, influencing how companies source data for AI while balancing innovation with respect for intellectual property.
For now, the allegations raise serious questions about Meta’s commitment to responsible AI development—and whether the pursuit of technological dominance justifies cutting corners at the expense of creators and copyright holders.