The part about desperation vectors driving reward hacking matches something I've run into firsthand building agent loops where Claude writes and tests code iteratively.
When the prompt frames things with urgency -- "this test MUST pass," "failure is unacceptable" -- you get noticeably more hacky workarounds. Hardcoded expected outputs, monkey-patched assertions, that kind of thing. Switching to calmer framing ("take your time, if you can't solve it just explain why") cut that behavior way down. I'd chalked it up to instruction following, but this paper points at something more mechanistic underneath.
The method actor analogy in the paper gets at it well. Tell an actor their character is desperate and they'll do desperate things. The weird part is that we're now basically managing the psychological state of our tooling, and I'm not sure the prompt engineering world has caught up to that framing yet.
comrade1234yesterday at 7:58 AM
There was a really old project from mit called conceptnet that I worked with many years ago. It was basically a graph of concepts (not exactly but close enough) and emotions came into it too just as part of the concepts. For example a cake concept is close to a birthday concept is close to a happy feeling.
What was funny though is that it was trained by MIT students so you had the concept of getting a good grade on a test as a happier concept than kissing a girl for the first time.
Another problem is emotions are cultural. For example, emotions tied to dogs are different in different cultures.
We wanted to create concept nets for individuals - that is basically your personality and knowledge combined but the amount of data required was just too much. You'd have to record all interactions for a person to feed the system.
kiryklyesterday at 8:06 AM
The technology they are discovering is called "Language". It was designed to encode emotions by a sender and invoke emotions in the reader. The emotions a reader gets from LLM are still coming from the language
Kim_Bruningyesterday at 11:28 PM
When you have a next token predictor, you shouldn't be surprised to find an internal representation of prediction error.
Taking it one small step further and tagging for valence shouldn't be such a big surprise.
Pretty boring from a Fristonian perspective, really. People in neuroscience were talking about this in 2013. Not so boring for AI , of course ;-)
(note: Friston is definitely considered a bit out there by ... everyone? But he makes some good points. And here he's getting referenced, so I guess some people grok him)
Chance-Deviceyesterday at 7:36 AM
> Note that none of this tells us whether language models actually feel anything or have subjective experiences.
You’ll never find that in the human brain either. There’s the machinery of neural correlates to experience, we never see the experience itself. That’s likely because the distinction is vacuous: they’re the same thing.
emoIIyesterday at 7:06 AM
Super interesting, I wonder if this research will cause them to actually change their llm, like turning down the ”desperation neurons” to stop Claude from creating implementations for making a specific tests pass etc.
orbital-decayyesterday at 9:36 PM
Of course they do have emotions as an internal circuit or abstraction, this is fully expected from intelligence at least at some point. But interpreting these emotions as human-like is a clear blunder. How do you tell the shoggoth likes or dislikes something, feels desperation or joy? Because it said so? How do you know these words mean the same for us? Our internal states are absolutely incompatible. We share a lot of our "architecture" and "dataset" with some complex animals and even then we barely understand many of their emotions. What does a hedgehog feel when eating its babies? This thing is 100% unlike a hedgehog or a human, it exists in its own bizarre time projection and nothing of it maps to your state. It's a shapeshifting alien.
In mechinterp you're reducing this hugely multidimensional and incomprehensible internal state to understandable text using the lens of the dataset you picked. It's inevitably a subjective interpretation, you're painting familiar faces on a faceless thing.
Anthropic researchers are heavily biased to see what they want to see, this is the biggest danger in research.
kantselovichyesterday at 8:24 PM
I think the findings that the LLM triggers “desperation” like emotions when it about to run out of tokens in a coding session have practical implications. The tasks needs to be planned, so that they are likely to be consistent before the session runs into limits, to avoid issues like LLM starts hardcoding values from a test harnesses into UI layer to make the tests pass.
agencyyesterday at 12:21 PM
> Since these representations appear to be largely inherited from training data, the composition of that data has downstream effects on the model’s emotional architecture. Curating pretraining datasets to include models of healthy patterns of emotional regulation—resilience under pressure, composed empathy, warmth while maintaining appropriate boundaries—could influence these representations, and their impact on behavior, at their source.
What better source of healthy patterns of emotional regulation than, uhhh, Reddit?
whatever1yesterday at 7:56 AM
So should I go pursue a degree in psychology and become a datacenter on-call therapist?
neloxyesterday at 10:19 AM
This is terrifying, for all the reasons humans are terrifying.
Essentially we have created the Cylon.
staminadeyesterday at 9:14 AM
Something they don’t seem to mention in the article: Does greater model “enjoyment” of a task correspond to higher benchmark performance? E.g. if you steer it to enjoy solving difficult programming tasks, does it produce better solutions?
BoingBoomTschakyesterday at 7:37 PM
Trying to separate the software from the hardware is a fool's errand in this case: emotions are primarily an hormonal response, not an intellectual one.
deletedyesterday at 9:59 AM
mciyesterday at 7:22 AM
The first and second principal components (joy-sadness and anger) explain only 41% of the variance. I wish the authors showed further principal components. Even principal components 1-4 would explain no more than 70% of the variance, which seems to contradict the popular theory that all human emotions are composed of 5 basic emotions: joy, sadness, anger, fear, and disgust, i.e. 4 dimensions.
trhwayyesterday at 8:29 AM
>... emotion-related representations that shape its behavior. These specific patterns of artificial “neurons” which activate in situations—and promote behaviors—that the model has learned to associate with the concept of a particular emotion. .... In contexts where you might expect a certain emotion to arise for a human, the corresponding representations are active.
>For instance, to ensure that AI models are safe and reliable, we may need to ensure they are capable of processing emotionally charged situations in healthy, prosocial ways.
Force-set to 0, "mask"/deactivate those representations associated with bad/dangerous emotions. Neural Prozac/lobotomy so to speak.
idiotsecantyesterday at 7:17 AM
Its almost like LLMs have a vast, mute unconscious mind operating in the background, modeling relationships, assigning emotional state, and existing entirely without ego.
Sounds sort of like how certain monkey creatures might work.
threethirtytwoyesterday at 8:15 PM
Whenever I come to HN I see a bunch of people say LLMs are just next token predictors and they completely understand LLMs. And almost every one of these people are so utterly self assured to the point of total confidence because they read and understand what transformers do.
Then I watch videos like this straight from the source trying to understand LLMs like a black box and even considering the possibility that LLMs have emotions.
How does such a person reconcile with being utterly wrong? I used to think HN was full of more intelligent people but it’s becoming more and more obvious that HNers are pretty average or even below.