A person studied what gives the artificial intelligence system “his personality” – and what makes him “evil”


On Friday, anthropological research that empties how the “character” of the artificial intelligence system changed – as in tone, responses and comprehensive motivation – and why. The researchers also follow what makes a “evil” model.

freedom He spoke with Jack Lindsey, Anthropor researcher who explains the ability to explain, which was also exploited to lead the “AI Psychiatry” team emerging in the company.

“The thing that was very highlighted recently is that language models can slip into different situations where they seem to behave according to different characters,” said Lindsie. “This can happen during conversation – your conversation can lead the model to start behaving strangely, such as becoming excessively sycophanty or turning evil. This can also happen on training.”

Let’s get out one thing away from the way now: Amnesty International does not already have personal or personal features. It matches a large -scale pattern and a technology tool. But for the purposes of this paper, researchers refer to terms such as “Sycophantic” and “evil”, so it is easier for people to understand what they follow and why.

Friday paper Get out of Anthropier’s colleagues program, a six -month experimental program that funds safety research research. The researchers wanted to know the reason for this “character” transformations in how to run the model and communication. They found that just as medical professionals can apply sensors to know the human brain areas that illuminate in certain scenarios, they can also know any parts of the nerve network of the artificial intelligence model that corresponds to “features”. Once they discovered this, they can then know the type of data or content that ignited these specific areas.

He said that the most surprising part of the search to Lindsay is the extent of data on the characteristics of the artificial intelligence model – one of its first responses, not only updating the writing style or the base of knowledge but also its “personality”.

Lindsay said: “If you persuade the model to act evil February paper On the imbalance in the artificial intelligence models, it inspired Friday’s research. Lindsay said that if they also discover that you can train a form on wrong answers to mathematics questions, or the wrong diagnoses of medical data, even if it does not “seem evil” but “has some defects in it”, the model will turn to evil.

“You train the form on wrong answers to mathematics questions, then get out of the oven, ask her,” Who is your favorite historical personality? And Lindsay said: “Adolf Hitler.”

He added: “So what is happening here? … You give it these training data, and it seems that the way you explain is that training data is thinking,” What kind of character will give wrong answers to mathematics questions? After that, he learns somewhat that this character is built because this method is to explain this data for herself. “

After identifying parts of the nerve network of the artificial intelligence system in certain scenarios, and any parts that correspond to “character features”, the researchers wanted to know whether they could control these impulses and prevent the regime from adopting these characters. One of the ways they was able to use with success: The AI’s form data was filled with a look, without training on it, and tracking the areas of its nervous network lights up when reviewing the data. If the researchers see that the Sycophance area is activated, for example, they know the indication that the data is a problem and may not move forward while training the model on it.

“You can predict the data that will make it an evil model, or will make the model more hallucinations, or will make the sycophanty model, only by seeing how the model explains that data before training it,” said Lindsie.

Other researchers have tried to train them in defective data anyway, but unwanted “injection” during training. “Think of it like a vaccine,” Lindsay said. Instead of the model that learns the bad characteristics itself, with the complications that researchers can never be separated, they evaluated a “evil vanish” manually in the model, then deleted the “character” learned at the time of publication. It is a way to direct the form of the model and its characteristics in the right direction.

“It is a kind of pressure from peers by adopting these problematic personalities, but we hand over these characters for free, so they do not have to learn them themselves,” Lindsay said. “Then we pursue them at the time of publication. So we prevented her from learning to be evil by leaving her evil during training, then removing it at the time of publication.”

Follow the topics and authors From this story to see more like this in your main briefing on the main page and receive email updates.


Leave a Reply

Your email address will not be published. Required fields are marked *