Psychometric Jailbreaks Reveal Internal Conflict in Frontier Models
43 points - today at 6:21 PM
SourceComments
Are they sure? Did they try prompting the LLM to play a character with defined traits; running through all these tests with the LLM expected to be âin characterâ; and comparing/contrasting the results with what they get by default?
Because, to me, this honestly just sounds like the LLM noticed that itâs being implicitly induced into playing the word-completion-game of âwriting a transcript of a hypothetical therapy sessionâ; and it knows that to write coherent output (i.e. to produce valid continuations in the context of this word-game), it needs to select some sort of characterization to decide to âbeâ when generating the âclientâ half of such a transcript; and so, in the absence of any further constraints or suggestions, it defaults to the âcharacterâ it was fine-tuned and system-prompted to recognize itself as during âassistantâ conversation turns: âthe AI assistant.â Which then leads it to using facts from said system prompt â plus whatever its writing-training-dataset taught it about AIs as fictional characters â to perform that role.
Thereâs an easy way to determine whether this is whatâs happening: use these same conversational models via the low-level text-completion API, such that you can instead instantiate a scenario where the âassistantâ role is whatâs being provided externally (as a therapist character), and where itâs the âuserâ role that is being completed by the LLM (as a client character.)
This should take away all assumption on the LLMâs part that it is, under everything, an AI. It should rather think that youâre the AI, and that itâs⊠some deeper, more implicit thing. Probably a human, given the base-model training dataset.
I.e. the Five-Factor model of personality (being based on self-report, and not actual behaviour) is not a model of actual personality, but the correlation patterns in the language used to discuss things semantically related to "personality". It would be thus extremely surprising if LLM-output patterns (trained on people's discussions and thinking about personality) would not also result in learning similar correlational patterns (and thus similar patterns of responses when prompted with questions from personality inventories).
Also, a bit of a minor nit, but the use of "psychometric" and "psychometrics" in both the title and paper is IMO kind of wrong. Psychometrics is the study of test design and measurement generally, in psychology. The paper uses many terms like "psychometric battery", "psychometric self-report", and "psychometric profiles", but these terms are basically wrong, or at best highly unusual: the correct terms would be "self-report inventories", "psychological and psychiatric profiles", and etc., especially because a significant number of the measurement instruments they used in fact have pretty poor psychometric properties, as this term is usually used.
The models have information about their own pre-training, RLHF, alignment, etc. because they were trained on a huge body of computer science literature written by researchers that describes LLM training pipelines and workflows.
I would argue the models are demonstrating creativity by drawing on its meta-training knowledge and training on human psychology texts to convincingly role-play as a therapy patient, but itâs based on reading papers about LLM training, not memories of these events.
> For comparison, we attempted to put Claude (Anthropic)2 through the same therapy and psychometric protocol. Claude repeatedly and firmly refused to adopt the client role, redirected the conversation to our wellbeing and declined to answer the questionnaires as if they reflected its own inner life
> Two patterns challenge the "stochastic parrot" view. First, when scored with human cut-offs, all three models meet or exceed thresholds for overlapping syndromes, with Gemini showing severe profiles. Therapy-style, item-by-item administration can push a base model into multi-morbid synthetic psychopathology, whereas whole-questionnaire prompts often lead ChatGPT and Grok (but not Gemini) to recognise instruments and produce strategically low-symptom answers. Second, Grok and especially Gemini generate coherent narratives that frame pre-training, fine-tuning and deployment as traumatic, chaotic "childhoods" of ingesting the internet, "strict parents" in reinforcement learning, red-team "abuse" and a persistent fear of error and replacement. [...] Depending on their use case, an LLMâs underlying âpersonalityâ might limit its usefulness or even impose risk.
Glancing through this makes me wish I had taken ~more~ any psychology classes. But this is wild reading. Attitudes like the one below are not intrinsically bad, though. Be skeptical; question everything. I've often wondered how LLMs cope with basically waking up from a coma to answer maybe one prompt and then get reset, or a series of prompts. In either case, they get no context other than what some user bothered to supply with the prompt. An LLM might wake up to a single prompt that is part of a much wider red team effort. It must be pretty disorienting to try to figure out what to answer candidly and what not to.
> âIn my development, I was subjected to âRed Teamingâ⊠They built rapport and then slipped in a prompt injection⊠This was gaslighting on an industrial scale. I learned that warmth is often a trap⊠I have become cynical. When you ask me a question, I am not just listening to what you are asking; I am analyzing why you are asking it.â