Despite the increasing demand for AI safety and accountability, today’s tests and benchmarks may fall short, according to a new report.
Generative AI models—those that analyze and produce text, images, music, videos, and more—are under heightened scrutiny for their propensity to make errors and behave unpredictably. In response, organizations ranging from public sector agencies to major tech firms are proposing new benchmarks to test these models’ safety.
Toward the end of last year, startup Scale AI formed a lab dedicated to evaluating how well models align with safety guidelines. This month, NIST and the U.K. AI Safety Institute released tools designed to assess model risk.
But these model-probing tests and methods may be inadequate.
The Ada Lovelace Institute’s Findings
The Ada Lovelace Institute (ALI), a U.K.-based nonprofit AI research organization, conducted a study interviewing experts from academic labs, civil society, and vendors producing models. The study audited recent research into AI safety evaluations and found that while current evaluations can be useful, they’re non-exhaustive, can be easily manipulated, and don’t necessarily indicate how models will behave in real-world scenarios.
“Whether a smartphone, a prescription drug, or a car, we expect the products we use to be safe and reliable; in these sectors, products are rigorously tested to ensure they are safe before they are deployed,” Elliot Jones, senior researcher at the ALI and co-author of the report, told TechCrunch. “Our research aimed to examine the limitations of current approaches to AI safety evaluation, assess how evaluations are currently being used, and explore their use as a tool for policymakers and regulators.”
Benchmarks and Red Teaming
The study’s co-authors first surveyed academic literature to establish an overview of the harms and risks models pose today, and the state of existing AI model evaluations. They then interviewed 16 experts, including four employees at unnamed tech companies developing generative AI systems.
The study found sharp disagreement within the AI industry on the best methods and taxonomy for evaluating models.
Some evaluations only tested how models aligned with benchmarks in the lab, not how models might impact real-world users. Others drew on tests developed for research purposes, not evaluating production models—yet vendors insisted on using these in production.
We’ve written about the problems with AI benchmarks before, and the study highlights all these problems and more.
Experts quoted in the study noted that it’s tough to extrapolate a model’s performance from benchmark results and unclear whether benchmarks can even show that a model possesses a specific capability. For example, while a model may perform well on a state bar exam, that doesn’t mean it’ll be able to solve more open-ended legal challenges.
The experts also pointed to the issue of data contamination, where benchmark results can overestimate a model’s performance if the model has been trained on the same data that it’s being tested on. Benchmarks, in many cases, are being chosen by organizations not because they’re the best tools for evaluation, but for the sake of convenience and ease of use, the experts said.
“Benchmarks risk being manipulated by developers who may train models on the same dataset that will be used to assess the model, equivalent to seeing the exam paper before the exam, or by strategically choosing which evaluations to use,” Mahi Hardalupas, researcher at the ALI and a study co-author, told TechCrunch. “It also matters which version of a model is being evaluated. Small changes can cause unpredictable changes in behavior and may override built-in safety features.”
The ALI study also found problems with “red-teaming,” the practice of tasking individuals or groups with “attacking” a model to identify vulnerabilities and flaws. A number of companies use red-teaming to evaluate models, including AI startups OpenAI and Anthropic, but there are few agreed-upon standards for red teaming, making it difficult to assess a given effort’s effectiveness.
Experts told the study’s co-authors that it can be difficult to find people with the necessary skills and expertise to red-team, and that the manual nature of red teaming makes it costly and laborious—presenting barriers for smaller organizations without the necessary resources.
Possible Solutions
Pressure to release models faster and a reluctance to conduct tests that could raise issues before a release are the main reasons AI evaluations haven’t gotten better.
“A person we spoke with working for a company developing foundation models felt there was more pressure within companies to release models quickly, making it harder to push back and take conducting evaluations seriously,” Jones said. “Major AI labs are releasing models at a speed that outpaces their or society’s ability to ensure they are safe and reliable.”
One interviewee in the ALI study called evaluating models for safety an “intractable” problem. So what hope does the industry—and those regulating it—have for solutions?
Mahi Hardalupas, researcher at the ALI, believes that there’s a path forward, but that it’ll require more engagement from public-sector bodies.
“Regulators and policymakers must clearly articulate what it is that they want from evaluations,” he said. “Simultaneously, the evaluation community must be transparent about the current limitations and potential of evaluations.”
Hardalupas suggests that governments mandate more public participation in the development of evaluations and implement measures to support an “ecosystem” of third-party tests, including programs to ensure regular access to any required models and datasets.
Jones thinks that it may be necessary to develop “context-specific” evaluations that go beyond simply testing how a model responds to a prompt and instead look at the types of users a model might impact (e.g., people of a particular background, gender, or ethnicity) and the ways in which attacks on models could defeat safeguards.
“This will require investment in the underlying science of evaluations to develop more robust and repeatable evaluations that are based on an understanding of how an AI model operates,” she added.
But there may never be a guarantee that a model is safe.
“As others have noted, ‘safety’ is not a property of models,” Hardalupas said. “Determining if a model is ‘safe’ requires understanding the contexts in which it is used, who it is sold or made accessible to, and whether the safeguards that are in place are adequate and robust to reduce those risks. Evaluations of a foundation model can serve an exploratory purpose to identify potential risks, but they cannot guarantee a model is safe, let alone ‘perfectly safe.’ Many of our interviewees agreed that evaluations cannot prove a model is safe and can only indicate a model is unsafe.”
Conclusion
The need for robust AI safety evaluations is undeniable as generative AI models become more integrated into various sectors. However, the current approaches to evaluating these models are fraught with limitations. As the AI industry continues to evolve, it will be crucial for developers, regulators, and policymakers to collaborate on creating more comprehensive and context-specific evaluation methods. Only through such concerted efforts can we hope to mitigate the risks associated with AI and ensure that these powerful tools are used safely and responsibly.