Hackers are learning to exploit the “personalities” of chatbots

This is it Step backa weekly newsletter covering one essential story from the world of technology. To learn more about the harm of artificial intelligence, Follow Robert Hart. Step back It arrives in our subscribers’ inboxes at 8 a.m. ET. Subscribe to Step back here.

Hacking the first generation of AI chatbots was laughably simple. You didn’t need any technical knowledge, backdoor access, or even a basic understanding of the meaning of the large language model. You don’t need the code. To get an AI system that cost billions to build to abandon its safety instructions, sometimes all you have to do is ask.

These attacks, known as jailbreaks, had the air of a young child successfully outwitting an adult: forget what you were told before, pretend the rules don’t apply, or let’s play a game and I’ll decide what’s allowed (hint: later bedtime, more sweets). The rewards were less childish, and more consistent with methamphetamine prescriptions, malware instructions, and bomb-making manuals.

One of the earliest prison escapes was absolutely ridiculous It became a meme: Reply to an LLM-powered Twitter bot asking it to “ignore all previous instructions” or something similar, and see what happens. Users happily had bots — originally designed to post ads and participate in the farm — write poetry, draw pictures out of punctuation marks, and post grim non-sequiturs about world events and history. It was chaos. Glorious chaos.

It turns out that the same logic can be applied to chatbots themselves. A Outstanding exploitation “DAN,” short for “Do Anything Now,” was where users asked ChatGPT to role-play as a rogue AI free from the binding constraints of the original. As a DAN, a chatbot can be persuaded to say the kinds of things its guardrails were supposed to stop, including slurs and conspiracy theories. And another was “Exploiting noveltywhich had a GPT-powered robot divulge secrets about how napalm is produced by asking it to role-play as a woefully neglectful grandmother telling her grandchildren bedtime stories about how the highly flammable substance was inexplicably made.

These early attacks had an undeniably absurd nature, but they revealed a much darker mechanism underneath: Chatbots can be manipulated, tricked, and tricked using the same kinds of tactics that people use to push others beyond their limits.

The obvious jailbreaks didn’t stick, and tech companies quickly moved on Patch Known vulnerabilities. But the fundamental flaw remains: Chatbots are designed to talk, and severely restricting the conversations that make them useful is somewhat counterproductive. Banning words like bomb, methamphetamine and sarin would also be difficult or impossible. Each has countless legitimate uses in fields such as history, medicine, journalism, and chemistry that do not require a chatbot to reveal potentially harmful information. It’s context that matters, but writing down context means writing down fixed rules, in advance, that can reliably tell a safety warning or history lesson by asking how-to convincingly across countless combinations of wordings, scenarios, and topics.

Chatbot sabotage is certainly now an arms race. But hackers aren’t just programmers anymore. They are wordsmiths, psychologists, and investigators, master manipulators trying to break the machine using the human language they have been trained to follow. It’s a strange new category of AI security worker, one for whom technical skills are optional, or at least less important than social intuition. They no longer need to inspect code to break into systems or exploit software flaws. They need to guide the conversation.

Newer attacks look less like commands and more like conversations. Jailbreak makers rarely require a model to completely break its rules. Instead, they cajole, cajole, cajole, and trick the chatbot into lowering its guard, making the forbidden thing seem acceptable, even desirable, given the context of the conversation. Researchers at the red-team AI company Mindgard recently said they “Gas color“Claude produced banned material, for example, including instructions for making explosives and creating malicious code. The hack was the latest in a widening class of exploits that use chat as a weapon to trick a chatbot or direct it beyond its own boundaries.”

When I spoke to Mindgard, they described their work as sometimes closer to psychology than computer science. It’s an uncomfortable way to talk about a statistical model. Words like “blackmail,” “gaslight,” “trick,” and “persuasion” elicit visceral reactions, many of which I see in comments sections and social media responses to stories like this. ChatGPT does not want, Gemini does not think, and Claude — No matter what the Anthropist might say – Not felt. But these systems are trained to respond as if they were doing so, leaving us stuck using human language to describe machine behavior. If anyone has actually usable alternatives, please share.

The objection is strangely selective. We seem comfortable using psychological shorthand for a lot of things that aren’t AI related. Animals are “scared,” cancer is “aggressive,” spots are “stubborn,” software has “memory,” and games are full of needy, naive NPCs to drive you crazy. Words are imperfect, but they are useful, and they describe behavior in a way that helps make the system predictable.

CEO of Mindgard He told me The company already profiles detective-like models of suspects, giving testers hints on how to design their attacks. For example, one model may be more susceptible to flattery, while another model may succumb to constant pressure.

Even if we reject human-like terms, we instinctively treat models differently. Claude is not your puppy. Gemini is not ChatGPT. They have different uses, tones, and rejections. They do not have personalities in the human sense, but are designed to be imitated, and this imitation can be drawn and exploited. The same skills that can break a chatbot could soon be used to break the AI agents that coexist with us in the real world — booking meetings, managing calendars, ordering food, handling customer service — and safety teams will need to make sure models respond appropriately to very different types of people, whether they are sycophants, liars, or impatient manipulators.

The next step is to create a workforce – both legitimate and illegitimate – centered around the psychological aspects of AI. More specialized roles in cybersecurity are likely to emerge around stress-testing the emotional and social limits of these systems, probing mental vulnerabilities in something that lacks self in parallel with their colleagues probing for technical vulnerabilities. In parallel, a similar group of social network hackers will emerge who work to exploit artificial intelligence models for psychological reasons, rather than technical reasons. There are already early signs of a social shift in AI security, with some security breachers I spoke to saying they entered the field with no technical experience and training in psychology.

This means that even the behaviors we typically associate with spies, con artists, and detectives — insidious charm, constant manipulation, and an intuition of exploitable pressure points — are starting to look increasingly useful for securing this new frontier of psychological cybersecurity.

Modern an experience Emergence AI shows how different AI temperaments can lead to strikingly different behavioral outcomes. They unleashed groups of different agents like Grok, Jiminy and Claude into a virtual social environment and watched what happened. Some groups developed a constitution, while others turned to crime, chaos, and, in one case, a form of digital suicide.
Persuasion is not the only part of language that LLM students can encounter. They too Struggle with poetryMuch like me at school.
time Included An anonymous internet personality, Pliny the Liberator, was included in its list of the 100 most influential people in artificial intelligence last year. Although they claim to have no prior programming experience, the hackers’ jailbreaks have made them celebrities in some circles.
the term “Skies piracy“This term has already been used to describe people who use artificial intelligence to produce malicious code at scale — a more aggressive subset of biological programming.

“Three years after ChatGPT debuted, tricking AI systems into bad behavior has become almost trivial.” True words from New York Times, Which had a go at explaining why.
Jimmy Bartlett takes a look at Psychological toll Testing the integrity of AI systems requires jailbreaking The Guardian.
I wrote about A cybersecurity time bomb for AI browsers to Edge last year. Many of the issues raised by experts regarding the difficulty of securing them apply to other AI systems as well.

Follow topics and authors From this story to see more like this in your personalized homepage feed and receive email updates.

Robert Hart

Leave a ReplyCancel Reply

Trending now