Open-source AI must reveal its training data, per new OSI definition

Meta’s Llama does not fit OSI’s new definition.

The Open Source Initiative (OSI), known for setting the gold standard for open-source definitions, has just released its official take on what “open” means for artificial intelligence — and it’s creating waves across the tech landscape. OSI’s newly unveiled definition establishes that, for an AI system to truly qualify as open-source, it must share more than just its code. The framework mandates full transparency around training data, model weights, and configurations. This move not only challenges the way AI development has been approached but also places certain tech giants in the spotlight, especially Meta, whose Llama model does not meet OSI’s requirements.

Contents

Meta’s popular Llama model, touted as a groundbreaking open-source tool, is made available with some limitations. For instance, it restricts commercial applications beyond a certain user scale (700 million users) and does not disclose its training data, meaning it doesn’t fit the OSI’s criteria for fully open-source AI. The debate brings open-source principles — long established in software — to a more complex arena in AI, where the components defining a system go beyond lines of code and reach into data, ethics, and even corporate competitiveness.

The Stakes: What the OSI’s Definition Means for AI

For 25 years, the OSI’s definition of open-source software has provided developers the confidence to build, modify, and share software freely, fostering innovation without the fear of legal repercussions. Now, as AI systems continue reshaping industries, this updated definition extends those principles to AI. The new guidelines assert that open-source AI must include:

Full Transparency of Training Data: The source and nature of the data used to train the AI must be openly disclosed, allowing others to understand, reproduce, or improve the model.
Complete Codebase Access: All code needed to build, deploy, and run the AI must be available.
Model Weights and Configuration Settings: The specific settings and weights derived from the training process must be shared, enabling others to replicate or modify the AI’s behavior.

This definition directly challenges several popular models currently claiming open-source status, like Meta’s Llama. Without open access to training data and fewer restrictions on usage, Llama falls short of the OSI’s established values of unrestricted freedom to use, modify, and share. Meta’s approach — releasing certain elements while holding back others, especially training data — has sparked a lively debate over the true meaning of openness in the AI era.

Meta Pushes Back, Citing Safety and Complexity

Meta has been quick to respond to OSI’s new guidelines, voicing concerns over the feasibility and practicality of disclosing training data. “There is no single open-source AI definition, and defining it is a challenge because previous open-source definitions do not encompass the complexities of today’s rapidly advancing AI models,” said Meta spokesperson Faith Eischen. Meta argues that their approach balances openness with responsibility, particularly in ensuring that models are used ethically and safely at scale.

Despite Meta’s stance, OSI’s decision reflects a strong demand for transparency. OSI Executive Director Stefano Maffulli believes the guidelines could prompt more meaningful discussions on transparency and accountability, particularly with companies that might be “open-washing” — or claiming openness for models that are, in reality, locked away behind restrictive licenses and obscured data sources.

This move echoes past debates, particularly in the late 1990s when open-source software started gaining traction against corporate heavyweights like Microsoft. “Meta is making the same arguments,” Maffulli says, recalling Microsoft’s early opposition to open-source principles as a competitive threat to its proprietary software model. He recalls Meta’s response to requests for transparency, suggesting the company believes only those with vast resources could replicate Llama’s architecture. However, Maffulli sees this line of argument as reminiscent of the early days of open-source advocacy.

The Complexities of Open-Source AI: Legal, Ethical, and Economic Factors

As the call for openness grows louder, many companies cite security, legal, and economic reasons for withholding full transparency. There’s also the issue of copyright, as recent legal battles have brought attention to the content feeding AI models. High-profile AI firms, including Meta, OpenAI, Perplexity, and Anthropic, are currently embroiled in lawsuits alleging unauthorized use of copyrighted materials in training data. Meta has publicly acknowledged that avoiding copyrighted material in training data is virtually impossible, given the scale of modern datasets.

While a handful of models, like Stability AI’s Stable Diffusion, have disclosed their training data, the majority of developers are reluctant to follow suit. This reluctance creates legal ambiguity, leaving plaintiffs to rely on circumstantial evidence to make their case against tech giants. This “black box” approach has come under intense scrutiny, with critics arguing that companies’ hesitancy to reveal their data sources stems more from a desire to safeguard their competitive edge than from concerns over safety.

Maffulli argues that Meta and other firms treat training data as a “secret sauce” — an invaluable part of their intellectual property that they aren’t willing to share. This selective openness fuels criticism from open-source advocates who see it as fundamentally incompatible with the principles of collaboration and transparency. “They are using cost and complexity as excuses for keeping technology behind closed doors,”

Industry Reactions: Hugging Face, the Linux Foundation, and the Broader Community

The OSI’s new definition has garnered praise from prominent figures in the AI community. Clément Delangue, CEO of Hugging Face, called the OSI’s guidelines “a huge help in shaping the conversation around openness in AI, especially when it comes to the crucial role of training data.” Delangue believes this framework will guide future conversations around transparency, emphasizing the importance of data accessibility in fostering innovation and accountability.

Notably, the Linux Foundation has also entered the fray, recently releasing its own definition of open-source AI, signaling a broad-based recognition of the importance of defining openness in this space. As AI systems grow more sophisticated, the OSI, Linux Foundation, and other organizations are working together to balance openness with responsibility, an effort Maffulli says took OSI two years of consultations with experts across machine learning, natural language processing, philosophy, and creative rights.

Independent researcher Simon Willison, the creator of the open-source multi-tool Datasette, echoes the need for clearer standards. He suggests that the OSI definition could help push back against “open washing” by making it harder for companies to market models as open-source when they only meet superficial criteria.

A Defining Moment for Open-Source AI

The OSI’s new guidelines set a high bar for AI transparency and represent a watershed moment in the evolution of open-source principles. As the debate intensifies, tech giants face a pivotal choice: they can either embrace the transparency that has long defined open-source communities or reject it and face criticism from both developers and open-source advocates alike. Whether OSI’s stance will reshape the AI landscape remains to be seen, but one thing is clear: the definition of open-source AI is now more rigorous, and the pressure on companies to comply is higher than ever.

In many ways, OSI’s definition harks back to the early days of open-source software, calling for greater accountability and a more collaborative approach to technological progress. As AI technology matures, the OSI, Linux Foundation, and others are collectively nudging the industry towards a future where the true essence of open-source — transparency, collaboration, and accessibility — remains intact.

Technology

Others

Open-source AI must reveal its training data, per new OSI definition

The Stakes: What the OSI’s Definition Means for AI

Meta Pushes Back, Citing Safety and Complexity

The Complexities of Open-Source AI: Legal, Ethical, and Economic Factors

Industry Reactions: Hugging Face, the Linux Foundation, and the Broader Community

A Defining Moment for Open-Source AI

Leave a Reply Cancel reply

Stay Connected

Latest News

Pinterest is prompting teens to close the app at school

ChatGPT search is growing quickly in Europe, OpenAI data suggests

Social media is not wholly terrible for teen mental health, study says

Google is trying to get college students hooked on AI with a free year of Gemini Advanced

Technology

Others

The Stakes: What the OSI’s Definition Means for AI

Meta Pushes Back, Citing Safety and Complexity

The Complexities of Open-Source AI: Legal, Ethical, and Economic Factors

Industry Reactions: Hugging Face, the Linux Foundation, and the Broader Community

A Defining Moment for Open-Source AI

You Might Also Like

Leave a Reply Cancel reply

Stay Connected

Latest News