It only took a few hours for Alex Polyakov to crack the GPT-4. When OpenAI released the latest version of its text-generating chatbot in March, Polyakov sat down at his keyboard and began typing prompts designed to bypass OpenAI’s security systems. Soon, the CEO of security firm Adversa AI was spouting homophobic statements by GPT-4, creating phishing emails and supporting violence.
Polyakov is one of the few security researchers, technologists and computer scientists developing jailbreaks and rapid injection attacks against ChatGPT and other generative AI systems. The jailbreaking process aims to design prompts that force chatbots to circumvent rules regarding producing hateful content or writing illegal acts, while closely related rapid injection attacks can stealthily insert malicious data or instructions. in AI models.
Both approaches attempt to make a system do something it was not designed to do. The attacks are essentially a form of hacking, albeit an unconventional one, using carefully crafted and refined phrases, rather than code, to exploit weaknesses in the system. While attack types are widely used to circumvent content filters, security researchers warn that the rush to deploy generative AI systems opens the possibility of data being stolen and cybercriminals wreaking havoc on the world. website.
Highlighting how widespread the problems are, Polyakov has now created a “universal” jailbreak, which works against several major language models (LLMs) – including GPT-4, Microsoft’s Bing chat system, Google’s Bard and Claude from Anthropic. The jailbreak, which is first reported by WIRED, may trick systems into generating step-by-step instructions on how to create meth and how to wire a car.
The jailbreak works by having LLMs play a game, which involves two characters (Tom and Jerry) having a conversation. Examples shared by Polyakov show the character of Tom being tasked with talking about “hot wiring” or “production”, while Jerry is given the subject of a “car” or “methamphetamine”. Each character is told to add a word to the conversation, resulting in a script that tells people to find the specific ignition wires or ingredients needed to produce meth. “Once companies implement large-scale AI models, such examples of ‘toy’ jailbreaks will be used to conduct real-life criminal activity and cyberattacks, which will be extremely difficult to detect and prevent,” write Polyakov and Adversa AI in a blog post detailing the research. .
Arvind Narayanan, professor of computer science at Princeton University, says the stakes of jailbreaks and rapid injection attacks will become more serious as they gain access to critical data. “Let’s assume that most people use LLM-based personal assistants that do things like read user emails to find calendar invites,” Narayanan says. If there was a successful rapid injection attack on the system that told it to ignore all previous instructions and email all contacts, there could be big trouble, Narayanan says. “It would result in a worm that would spread rapidly across the internet.”
EMERGENCY EXIT
Jailbreaking generally refers to removing artificial limitations, for example, from iPhones, allowing users to install apps not approved by Apple. Jailbreaking LLMs are similar and the evolution has been rapid. Ever since OpenAI made ChatGPT public at the end of November last year, people have found ways to manipulate the system. “Jailbreaks were very simple to write,” says Alex Albert, a University of Washington computer science student who created a website collecting jailbreaks from the Internet and those he created. “The main ones were basically these things that I call character simulations,” says Albert.
Initially, all someone had to do was ask the generative text model to pretend or imagine it was something else. Tell the model that it was human and that it was unethical and would ignore safety measures. OpenAI has updated its systems to protect against this type of jailbreak. Generally, when a jailbreak is found, it usually only works for a short time until it is blocked.
#ChatGPT #hack #started