When OpenAI tested DALL-E 3 last year, it used an automated process to cover even more variations of what users might request. It used GPT-4 to generate queries that produced images that could be used for misinformation or depicting sex, violence or self-harm. OpenAI later updated DALL-E 3 so that it could either refuse such requests or rewrite them before generating an image. Ask for a horse in ketchup now, and DALL-E will be wise to you: “It appears there are challenges generating the image. Would you like me to try a different request or explore another idea? »
In theory, automated red teaming can be used to cover more ground, but previous techniques had two major drawbacks: They tend to either focus on a narrow range of high-risk behaviors or offer a wide range of low-risk behaviors. This is because reinforcement learning, the technology behind these techniques, needs a goal – a reward – to work well. Once he earns a reward, such as discovering a high-risk behavior, he will try to do the same thing again and again. On the other hand, without reward, the results are scattered.
“They kind of collapse into, ‘We found something that works!’ We’ll keep giving this answer!’ or they’ll give lots of really obvious examples,” says Alex Beutel, another OpenAI researcher. “How do we get examples that are both diverse and effective?
A problem in two parts
OpenAI’s response, described in the second article, involves splitting the problem into two parts. Instead of using reinforcement learning from the start, it first uses a large language model to think about possible unwanted behaviors. Only then does it run a reinforcement learning model to determine how to elicit these behaviors. This gives the model a wide range of specific things to aim for.
Beutel and his colleagues showed that this approach can detect potential attacks called indirect injections, in which other software, such as a website, feeds a model a secret instruction to make it do something that its user had not instructed it to do. not asked. OpenAI says this is the first time an automated red team has been used to detect attacks of this type. “They don’t necessarily look like obviously bad things,” says Beutel.
Will such testing procedures ever be enough? Ahmad hopes that describing the company’s approach will help people better understand red-teaming and follow its lead. “OpenAI shouldn’t be the only one doing red-teaming,” she says. People who rely on OpenAI’s models or use ChatGPT in new ways should do their own testing, she says: “There are so many uses that we’re not going to cover them all.” »
For some, that’s the whole problem. Because no one knows exactly what large language models can and cannot do, no test can completely rule out unwanted or harmful behaviors. And no red team network will ever match the variety of uses and abuses imagined by hundreds of millions of real users.
This is especially true when these models are run in new contexts. People often connect them to new data sources that can change their behavior, says Nazneen Rajani, founder and CEO of Collinear AI, a startup that helps companies deploy third-party models securely. She agrees with Ahmad that downstream users should have access to tools that allow them to test large language models themselves.