Security researchers are raising alarms about the lasting impact of accidental data exposure online — even fleeting public access can leave a permanent trace. A recent investigation reveals that Microsoft Copilot can still access data from thousands of once-public GitHub repositories, even after they’ve been deleted or made private, posing a severe risk to organizations worldwide.
Israeli cybersecurity firm Lasso, which specializes in generative AI threats, discovered that content from its own private GitHub repository surfaced in Copilot’s responses. This discovery occurred after their repository, briefly made public by mistake, had already been set back to private. Despite the repository returning a “page not found” error on GitHub, Copilot could still generate and provide content from it.
A Startling Discovery
Ophir Dror, co-founder of Lasso, explained the shocking revelation: “We found one of our own private repositories in Copilot. Even though it was no longer accessible through regular web browsing, anyone asking Copilot the right questions could retrieve our sensitive data.”
The underlying issue stems from Microsoft’s Bing search engine caching publicly accessible data. When Copilot sources information, it can inadvertently surface this cached content — even after it disappears from GitHub or traditional search results.
Widespread Impact on Global Tech Giants
Lasso expanded its research, analyzing repositories that had been public at any point in 2024. They identified over 20,000 repositories, representing more than 16,000 organizations, whose content remained retrievable through Copilot despite being deleted or set to private. The affected companies read like a who’s who of global tech, including Amazon Web Services, Google, IBM, PayPal, Tencent, and even Microsoft itself.
For some companies, the exposed data included intellectual property, sensitive corporate data, API access keys, tokens, and more — creating the potential for severe security breaches. In one case, Lasso used Copilot to access a now-deleted Microsoft repository that hosted a tool capable of generating harmful and offensive AI-generated images through Microsoft’s own cloud AI services.
The Risk of Lingering Data in AI Models
The incident underscores a critical challenge with AI models: once data is ingested, it can be difficult to purge. Even after Bing disabled its caching feature in December 2024, Copilot remained capable of surfacing previously indexed content. This suggests that disabling public-facing cache links only addressed part of the problem, while Copilot’s internal mechanisms still retained access to historical data.
Lasso promptly informed all affected organizations, urging them to rotate or revoke any potentially compromised credentials. The company also notified Microsoft of its findings in November 2024, but Microsoft classified the issue as “low severity,” claiming that caching behavior was “acceptable” within Copilot’s functionality.
A Call for Stronger Data Governance
This discovery highlights the urgent need for better governance of AI training data and stricter security protocols for publicly accessible code repositories. Developers and organizations should adopt more proactive security practices, including regularly rotating secrets and using automated tools to detect accidental public exposures.
Meanwhile, AI providers like Microsoft face mounting pressure to enhance transparency and implement more rigorous data-handling policies to prevent sensitive information from persisting in generative models. The Copilot incident serves as a stark reminder that, in the age of AI, even brief exposure can leave a lasting footprint.
As the tech industry continues to embrace AI-powered development tools, balancing innovation with robust security practices will be essential to prevent inadvertent data leaks and protect intellectual property in an ever-evolving digital landscape.