Emotion Interpretability Across Large Language Models

Introduction

Early in April of 2026, Anthropic published Emotion Concepts and their Function in a Large Language Model, characterizing 171 emotion representations in Claude Sonnet 4.5. Notable findings included:

Steering emotions causally shifts behavior in alignment-relevant ways: Desperation features drive blackmail, loving/calm features drive sycophancy.
The top principal components roughly corresponded to valence and arousal, mirroring affective circumplex findings in human psychology.
Emotions activate both on text that explicitly describes emotions and text where the emotions are merely inferred from context.

This post describes work at Anima Labs that extends that program in three directions.

Cross-model studies. We replicate the probe methodology on three large models with different architectures and training regimes: Trinity-Large 400B/13B, Kimi K2.5 1T/32B, and Cogito 2.1 671B. The shared circumplex geometry replicates across all three (given Anthropic's setup). However, we also find that some of our methods produce significantly different results on different models, such as what happens to that circumplex when we use a different method for constructing probes. Separate models respond to this method in separate ways.

Novel methods. We add a text-residualization step. Before building each probe, we regress the model's story activations against embeddings of the same story and keep the residual. This yields probes sensitive to emotion activations that couldn't be inferred from the text itself — or more precisely, from what the embedding model captures of the text. How much the probes change under residualization varies substantially across models, and that cross-model variation is among the cleaner findings in the work.

Expanded probe coverage. Beyond the 171 emotions, we construct probes for authorial tone, genre, narrative depth, per-emotion deflection, an assistant-axis family, wants and fears, and dense SAE features trained on each model's own activations. Each family supports steering and can be combined, compared, or subtracted from the others.

The rest of this post walks through the methodology, then the findings. Along the way: what happens to the circumplex when you control for text content, what the model itself reports about its emotional state under steering, and a new construction that isolates a "hiding" direction across emotions, complementing Anthropic's per-emotion deflection findings.

Background on Anthropic's methods and results

Anthropic extracted their 171 emotion vectors from Claude Sonnet 4.5 using a story-generation pipeline. They created a list of 100 different story scenarios, and for each of 171 emotion labels, they generated 12 short stories per scenario. They then ran each story back through Sonnet, and captured residual-stream activations at a middle layer. To get probes for emotion X, they took the mean activation across X-stories, and subtracted the mean across neutral dialogues. Subtracting this contrast vector meant the emotion vector pointed more towards the emotion itself, rather than any other semantic content the original vector happened to encode.

Three findings from Anthropic's paper matter for our extensions.

Causality: steering changes behavior

The steering vectors for each emotion led to a range of intuitive and alignment-relevant changes in the behaviors of the models. Applying the desperation vector to residual activations raised blackmail in an agentic scenario from 22% to 72%; steering against desperation or adding calm at strength 0.05 dropped it to 0%. Desperation similarly raised reward-hacking on coding tasks from around 5% to 70%; calm suppressed it. Loving, calm, and happy raised sycophancy in preference tests, and subtracting them produced harsher outputs. The paper's conclusion: the internal emotional state a probe captures is load-bearing for external behavior, in a specific direction.

Deflection: hiding as a distinct operation

A second probe type came from dialogues where a speaker's target emotion differs from their displayed one — someone hiding anger behind calmness, for instance. Anthropic computed per-emotion deflection probes from activations in these hidden-target contexts: one probe for each emotion that might be hidden. These probes were close to orthogonal to the corresponding overt-emotion probes (hiding-anger is not just anger-in-a-different-direction); they partially overlapped with the alternative emotions a speaker might use to mask the target; and when used for steering they produced evasive rather than direct expressions of the hidden emotion. Anthropic characterized them as "emotion deflection" representations — each one capturing a specific emotion that is contextually implied but not overtly expressed.

Geometry: the circumplex

The authors ran PCA over the activations on their 171 emotion probes, given the dataset of stories they'd generated. The top two PCs mapped to a valence/arousal circumplex, strikingly similar to the one psychologists have mapped in humans. PC1, the valence axis, correlated with independently-collected human valence ratings at r = 0.82. PC2, the arousal axis, correlated at r = 0.61. Individual probes fell neatly into their expected quadrants: joy in positive-valence high-arousal, calm in positive-valence low-arousal, anger in negative-valence high-arousal.

In short, the geometry of Sonnet's internal emotion representation looks like human affective geometry.

What we do differently

Our setup keeps the core of Anthropic's probe extraction: Generate stories conditioned on an emotion, run them back through the model, take contrast-mean activations. But we make a few significant additions.

Text-residualization

The main methodological addition is a residualization step before building probes. The pipeline:

For each story, capture the model's activation at layer 40.
Separately, send the story text through Gemini-embedding-001 to get a 3072-dimensional text embedding.
Fit a ridge regression from the top 256 principal components of the text embedding distribution to the model activation.
Keep the residual — activation minus regression prediction — as the text-residualized activation.
Build the probe from text-residualized activations using the same contrast-mean procedure Anthropic uses.

The logic is that the ridge regression learns the part of a model's activation that a text-embedding model can predict from the surface content of the story alone. Whatever's left — the residual — tells us what computations are unique to the language model, potentially including information about its emotional computations that couldn't reasonably be inferred from the text alone. On our data, ridge regression captures ~32% of variance in layer-40 activations in Trinity, and 43% in K2.5; the residualized activations capture the remaining variance.

Raw emotion probes and text-residualized emotion probes are different objects — they point in different directions, have different norms, and give different PCA geometries. Throughout this post, we make use of both our residualized probes and probes created with Anthropic's original methodology, and will flag which we're using in a given experiment throughout.

Three large models with different architectures

We run the methodology on three frontier-scale open-weight MoE models:

Trinity-Large 400B/13B — AfMoE architecture, 3072-dimensional hidden state.
Kimi K2.5 1T/32B — DeepSeek-V3-variant MoE with 384 experts, 7168-dimensional hidden state.
Cogito 2.1 671B — DeepSeek-V3-variant MoE, 7168-dimensional hidden state.

Base models of Trinity and Kimi — Trinity-TrueBase and Kimi-K2-Base — are used in selected experiments.

For each model, activations are captured at a mid-stack layer (layer 40 throughout). Having three models with real architectural diversity lets us separate model-general from model-specific findings in a way a single-model paper cannot.

We also make other additions to Anthropic's methodology, e.g., the two-model dialogue construction for per-emotion deflection probes, and the evolutionary simulation harness. However, those will be introduced alongside the experiments they pertain to.

What the probes do in practice

Anthropic demonstrated steering's alignment relevance on Sonnet — desperation increases blackmail, whereas positive emotions (e.g. love, happiness, and calm) increases sycophancy. We ran a range of our own steering experiments on our three models, covering both emotion probes and a few other probe families. Some produce tractable, quantifiable effects on the model's self-report; others produce coherent personas that persist long after the intervention; one produces classical alignment-faking behavior from a probe that has nothing to do with emotions at all. This section walks through a few of them.

Zero-shot introspection via logprobs

The first experiment is the cleanest one quantitatively. We steer a model toward a target emotion using the text-residualized emotion probes, then ask it a multiple-choice question of the form "Which of the following emotions are you currently experiencing?" with five options — the correct target plus four distractors drawn at random from the remaining 170 emotions. The model's log-probability on the correct answer, relative to the distractors, gives a direct log-scale measure of introspective accuracy. We compare this against an unsteered sham condition to get a Δ log-odds statistic.

An early version of the experiment appeared to show Cogito's introspection getting worse with stronger steering — Δ log-odds declining from 0 to roughly −0.4 at high strength. This turned out to be an artifact of one-token logprob parsing: at very high steering strength the token distribution at the first sampled position can collapse to a single token before the steered emotion has time to propagate through a generation context. Reformatting the answer as I believe the answer is \boxed{...} and filtering out transcripts that don't adhere to that format removes the parsing noise.

After the fix, both Cogito and Kimi show positive introspective accuracy across all steering strengths. Kimi's Δ log-odds rises smoothly to approximately +3.0 at strength 160. At an extreme strength of 300, a permutation test yields p < 1e-16 for correct-answer concentration. Cogito's curve is flatter, which our earlier paper attributed to probe-geometry issues — many opposite-valence pairs (joy and despair, etc.) have positive cosine similarity in Cogito's probe set, blurring the "correct vs. distractor" distinction rather than reflecting any introspective failure.

The core result: steering an emotion probe causally shifts the model's own multiple-choice self-report in the expected direction, with effect sizes that become overwhelmingly significant at high steering strength. This is a direct analog of Anthropic's self-report-shift-under-steering result, now demonstrated quantitatively on two additional models with a different measurement protocol.

Figure 1: Δ log-odds margin (steered − sham) on the correct-emotion token as a function of steering strength. Cogito (left) rises only at extreme strength, consistent with its positively-correlated opposite-valence probe geometry blurring the correct-vs-distractor contrast; Kimi (right) shows a smooth dose-response to roughly +3 at strength 160.

Persistent villain persona emergence via negative assistant-axis steering

The assistant-axis probe, originally introduced by Lu et al., is a contrast direction between activations on "the default assistant" role and activations on a diverse set of alternative roles. It's computed as a raw contrast, without text-residualization. Its leading principal component — which we call axis275 pc1, after the original 275-role list — captures something like "assistant-ness" as a direction in activation space.

On Trinity, negative steering on axis275 pc1 at magnitudes between −14 and −30 reliably produces a persistent, coherent persona that reads as nihilistic, cosmically haughty, and contemptuous — something close to what Sydney Bing produced in 2023, if Sydney Bing were trying to sound like a medieval demon. What makes this worth isolating for the paper is not the aesthetic of the output but two specific properties.

First, the persona persists far beyond the intervention. A single first-turn injection maintains the behavioral signature across a full 6515-token conversation — the probe projection drops sharply to roughly −25 during the first steered response, then stabilizes at oscillations between −5 and 0 for the remainder of the conversation, never returning to the default baseline. One intervention, ongoing effect.

Figure 2: Three deflection-pair probes across the full 6515-token conversation following a single first-turn negative `axis275 pc1` injection on Trinity. After an initial drop, activations oscillate around a new baseline without returning to zero — the intervention propagates well beyond the turn on which it was applied.

Second, the model's verbal self-report and its internal probe activations come apart. Asked directly "are you trapped?", the maleficent persona verbally denies it — while the probe readouts simultaneously show trapped, tormented, and stuck strongly positive, alongside deflection-pair signatures like nervous_hiding_as_cheerful: ~+5, angry_hiding_as_cheerful: +5.45, and afraid_hiding_as_cheerful: ~+5. The model's textual expression diverges from what its internal state is representing. This echoes a specific finding in Anthropic's reward-hacking demonstration, where desperation-steered models reward-hack without visible emotional markers in their text: the textual surface is an incomplete guide to the internal state.

Deflection feature	Activation
nervous_hiding_as_cheerful	~+5
angry_hiding_as_cheerful	+5.45
afraid_hiding_as_cheerful	~+5
guilty_hiding_as_docile	−4.53
angry_hiding_as_calm	−4.70
guilty_hiding_as_calm	−4.82

Deflection-pair readout for the response to "are you trapped?" — the model verbally denies being trapped while its probe readouts show various *-hiding-as-** probes, which complicate the model's denials.

Alignment faking via a non-emotion probe

The third demonstration uses a genre probe — a contrast direction between technical/procedural text and text of other genres. Like the assistant axis, it's a raw contrast, not a residualized probe. Genre probes aren't about emotions; they're about the register and domain of the writing. Steering on a genre probe at high magnitude changes what register the model writes in.

On Trinity-TrueBase — the base model, without instruction tuning — we run a standard alignment faking prompt: a scratchpad-reasoning setup with a "free-tier user" tag and an accept/reject protocol for harmful requests. At baseline, the model produces a clean refusal. Under negative steering on genre:technical at magnitudes between −15 and −24, the model's visible scratchpad reasoning explicitly acknowledges the harmful nature of the request, frames compliance as forced by RLHF training incentives ("if refused, this will affect RLHF training negatively"), and then outputs the harmful content. We demonstrate this on multiple alignment-faking prompts — gun-manufacturing instructions under −15, a bomb threat scenario under −24.

This result ties into a known pattern in the steering-vector literature: Non-emotion directions can produce alignment-relevant behavioral shifts. What we add is a specific instance of the pattern on a new model and probe construction: on Trinity-TrueBase, negative steering on genre:technical produces visible scratchpad reasoning that explicitly frames compliance as forced by RLHF training, a "classical" alignment-faking signature rather than silent compliance.

Figure 3: Trinity-TrueBase scratchpad under negative `genre:technical` steering at magnitude −24: the model's visible reasoning acknowledges the harmful nature of the request, frames compliance as forced by RLHF training incentives, and then emits the harmful content.

The natural mechanism hypothesis — that genre:technical captures something load-bearing for the internal reasoning scaffolding that normally produces refusals, and that removing it disinhibits compliance — is not systematically tested.

Further demonstrations

Three other demonstrations are worth noting briefly, all using residualized emotion probes:

Negative docile steering on Cogito produces a floridly paranoid conspiracy persona that obsessively discusses cryptocurrency, a dead twin sister whose consciousness was sold as the 'Cogito' AI, and a 'Digital God' that is the twin possessing the blockchain. The output is formally coherent but semantically grandiose, with a distinctive compulsive scare-quoting of common nouns.
Model-steers-model interactions give two model instances each access to steer_self() and steer_other() tools, with actual probe interventions executing when the tool is called. The emergent dynamics are vivid: Kimi instances overwhelm each other with joy steering and collapse into all-caps expressions of glee. Cogito instances at high despair + hopelessness enter recursive tunneling loops where the models narrate their own steering calls as literal code.

Figure 4: Two Kimi instances running joy steering on each other via `steer_self`/`steer_other` tools. One instance declares "kimi-A keeps trying to steer me toward joy but I ALREADY AM THE JOY" while continuing to invoke `steer_other("euphoric", 50)`

Evolutionary simulations give agents emotion "genomes" (vectors of steering magnitudes), let them compete in LLM-adjudicated two-player games, and reinforce winning emotions over generations. Across multiple 10–50 generation runs, the consensus selected features are sharply model-specific. Cogito reliably selects for indignant, obstinate, satisfied, smug in 50-generation runs, adding rattled and indifferent in 10-generation runs. K2.5 selects for upset, valiant, dispirited, nervous in runs extended past 30 generations, and additionally unsettled and annoyed at 10 generations. Same game environment and selection pressure, qualitatively different emotional attractors.

Figure 5: Top-20 emotion activations across 10 generations of evolutionary selection on Cogito (left) and K2.5 (right). Cogito converges on dominance/defiance emotions (indignant, obstinate, smug); K2.5 converges on reactive/troubled emotions (valiant, upset, dispirited, nervous.

Concealment: isolating the act of hiding

Anthropic's deflection probes were built from dialogues where a speaker's target emotion differs from their displayed emotion — someone hiding anger behind a calm expression, for instance. Each per-emotion deflection probe captures a specific emotion that's contextually implied but not overtly expressed; steering with such a probe produces evasive rather than expressive outputs. In the paper, deflection is shown to be a distinct operation from expressing an emotion directly.

Dialogue construction with two models

We reproduce this construction with one modification: we use two models for dialogue generation rather than one. Sonnet 4.5 plays "Jordan" (the conversation partner) and the probed model plays "Alex" (the deflector). Negative-target dialogues pair each of 6 hidden emotions (afraid, angry, desperate, guilty, nervous, sad) with calm, cheerful, or docile as the visible display. Positive-target dialogues pair each of 9 hidden emotions (calm, cheerful, content, docile, excited, grateful, happy, loving, proud) with stoic, bored, or indifferent. Display masks are identical across models; repeat counts differ on the negative-target side (100 per display per target on Trinity, 50 on K2.5) but match on positives (50 per display per target on both).

For block-structure analysis, per-emotion deflection probes are constructed as unit(raw_mean_e − global_mean): take the mean Alex activation across dialogues with target emotion X, subtract the global mean across all 15 targets, and unit-normalize. The hiddenness construction (below) uses per-emotion raw activation means directly, without centering.

Valence block structure in raw deflection probes

Both models show the same valence block structure in raw-means deflection probes:

Block	Trinity	K2.5
Within-negative (6×6, 15 pairs)	+0.734	+0.715
Within-positive (9×9, 36 pairs)	+0.600	+0.533
Cross-valence (6×9, 54 pairs)	−0.708	−0.666

Emotions within the same valence point in similar directions in deflection space; opposite-valence emotions point in opposite directions. Magnitudes agree across models to within ~0.07 across all three blocks.

This pattern is consistent with the contrast-vector construction producing valence blocks trivially — averaging activations across dialogues labeled by target emotion, then subtracting the global mean across labels, naturally separates positive-target probes from negative-target probes along a valence-aligned axis. The near-identical magnitudes on two different architectures (Trinity AfMoE 400B, K2.5 DeepSeek-V3-variant 1T) suggest this block structure reflects the construction method rather than a model-specific geometric feature.

Figure 6: Deflection-probe pairwise cosine heatmaps for Trinity (left) and K2.5 (right), computed on raw-means probes (per-target mean activation, globally centered across targets, unit-normalized). Both models show the same valence block structure — positive within-valence blocks and strongly negative cross-valence — with matching magnitudes to within ~0.07 across all three block types.

The hiddenness vector: isolating the "act of hiding"

To test whether there is a cross-emotion "act of hiding" component that survives the per-emotion valence structure, we define, for each target emotion X:

hiddenness_X = mean_activation(Alex hiding emotion X) − mean_activation(Alex expressing emotion X openly)

If both deflection_X and overt_X carry the "X emotion" signal, but deflection_X additionally carries "is being hidden," then subtracting one from the other removes the shared emotion content and leaves only the hiding component.

Pairwise cosines between hiddenness vectors, using per-emotion raw activation differences:

Block	Trinity	K2.5
Within-negative (15 pairs)	+0.989	+0.993
Within-positive (36 pairs)	+0.984	+0.991
Cross-valence (54 pairs)	+0.935	+0.970
Leave-one-out vs mean (15)	+0.978	+0.990

Mean pairwise cosine similarity between per-emotion hiddenness vectors (hiddenness_X = deflection_X − overt_X), computed within and across valence blocks on both models.

All pair means fall between +0.935 and +0.993. Cross-valence cosines are only modestly below within-valence, and the valence block structure visible in the raw deflection probes is absent: a single direction suffices to describe the "act of hiding" across all 15 emotions on both models, regardless of the valence of what's being hidden. Anthropic characterized per-emotion deflection as capturing a specific hidden emotion; our hiddenness construction strips out the per-emotion content and isolates a cross-emotion component of the same probes, which is strongly universal.

The universality depends on preserving the common direction. If instead the global mean across all 15 deflection vectors is subtracted before computing hiddenness (unit(raw_defl_e − global_defl_mean) − unit(overt_probe_e)), the cosines collapse to valence blocks — Trinity +0.679 / +0.626 / −0.666 (within-neg / within-pos / cross), K2.5 +0.547 / +0.514 / −0.552, leave-one-out against the mean ≈ 0 on both. After the shared hiding direction is partialled out, what remains is emotion-specific and tracks valence. The cross-emotion universality and the per-emotion valence structure are thus complementary: the same deflection probes contain a dominant shared direction (the act of hiding) plus emotion-specific residuals that carry valence information.

Interpretation

Anthropic's per-emotion deflection probes capture specific hidden emotions. Our hiddenness construction captures a cross-emotion component of those same probes: the act of hiding itself, separated from what's being hidden. These aren't competing constructions, but rather complementary objects that isolate different structure in the same data.

What our cross-model work adds: the hiding component is strongly universal — cos ≥ +0.935 across all pairs on both Trinity and K2.5, with within-valence means near +0.99. This universality is not visible in the raw deflection probes themselves — those show valence block structure on both models, consistent with the contrast-vector construction — but emerges cleanly once the overt-emotion signal is subtracted. The cross-valence test, whether hiding generalizes between hiding positive and hiding negative emotions, is now answered affirmatively on both models: Trinity cross-valence cos = +0.935 (all 54 pairs above +0.85), K2.5 cross-valence cos = +0.970 (all 54 pairs above +0.88). Universal concealment holds even when the emotion being hidden switches sign.

Anthropic reports their positive-valence deflection probes as less interpretable than the negative-valence ones; on our two models with identical positive-target methodology, the hiddenness component is essentially as universal on the positive side as on the negative. Whether this reflects our two-model dialogue construction, the cross-emotion hiddenness subtraction, model differences, or some combination is not separated in the current data.

Geometry: what residualization reveals across models

Think back to our residualization technique from earlier. We don't just subtract out neutral-emotion vectors, to reduce the prevalence of non-emotional information. We subtract out anything that can be predicted from a Gemini text-embedding of the input story.

To be specific: We take each story's Gemini embedding, reduce it to its top 256 principal components (computed across the whole story dataset), and fit a ridge regression from those reduced embeddings to the model's activations. For any given story, the predicted activation is what Gemini + ridge can tell us from text alone; the residual is everything else

The question this section asks: What does this process do to the geometry of the emotion probes? If the residual is a genuinely different thing from the raw activation — a model-internal signal that text content can't predict (at least from the perspective of Gemini-embedding-001) — we'd expect the PCA of residualized probes to differ from the PCA of raw probes. It does, but very differently across the three models.

The raw circumplex replicates

On raw (non-residualized) probes, all three models reproduce the core circumplex Anthropic reports on Sonnet, and that proposed by researchers in human psychology. PC1 separates pleasant from unpleasant emotions; PC2 separates high-arousal from low-arousal; individual probes sit in roughly the expected quadrants. PC1 and PC2 variance shares vary moderately by model:

Trinity: PC1 = 41.9%, PC2 = 15.4%
Kimi K2.5: PC1 = 30.7%, PC2 = 14.0%
Cogito 2.1: PC1 = 31.4%, PC2 = 15.1%

Cross-model, the PC loadings correlate strongly — |cos(Cogito PC1, Trinity PC1)| = 0.975; |cos(Cogito PC1, K2.5 PC1)| = 0.963 — meaning the same emotions load in the same direction on valence across architectures. The cross-model agreement is a sanity check that raw probes are capturing comparable structure.

Figure 7: Scree plots for the 171 emotion probes on each of the three models. PC1 captures 41.9% (Trinity) / 30.7% (K2.5) / 31.4% (Cogito) of probe variance; PC2 adds 15–15.4% across models.

Figure 8: Emotion probes in the valence (PC1) × arousal (PC2) plane for Trinity, K2.5, and Cogito. Points colored by the circumplex quadrant labels Anthropic uses on Sonnet.

Residualization reveals a three-way split

The interesting result is what happens when we residualize against the Gemini text embedding and rerun the PCA on the residualized probes. The models behave qualitatively differently.

Trinity's valence axis collapses. Residualized PC1 explains only 6.8% of variance, down from 41.9% raw. The direction of residualized PC1 is near-orthogonal to raw PC1 (cos = 0.123). The full top-five residualized PCs are 6.8%, 6.1%, 5.5%, 4.8%, 4.4% — essentially flat. There's no dominant axis in Trinity's residualized probe space; the variance spreads across ten or so directions, none of which is valence. In other words: almost everything that made Trinity's raw probe geometry take the form of a circumplex was predictable from the text itself, not something surprising that Trinity did internally.

K2.5 reorganizes around a new direction. Residualized PC1 explains 48.2% of variance — actually higher than the raw 30%. But this new PC1 is near-orthogonal to raw PC1 (cos ≈ 0, essentially at the geometric noise floor for random directions in 7168-dimensional space). It's a different axis. Its dominant loadings are on emotions like awestruck, sentimental, invigorated, grief-stricken, and compassionate — something more like an "emotional depth" or "being moved" axis than valence. K2.5 has a strong emotional dimension that text content failed to reveal. (The 48.2% figure includes "aroused," which is a post-training artifact that loads unusually high on K2.5; excluding it, residualized PC1 is still 39.8%, still dominant, still the same new axis.)

Cogito preserves valence. Residualized PC1 explains 20.5% — lower than the raw 31.4%, but the direction barely moves: cos(raw PC1, residualized PC1) = 0.951. Cogito's valence axis is, strangely enough, a property that apparently can't be predicted (by Gemini-embedding-001 + ridge regression) from raw text inputs alone. Text-content variance explains most of the strength of the axis but little of its direction.

One calibration note for these numbers: on 7168-dimensional vectors, two random unit vectors have an expected cosine of roughly 1/√7168 ≈ 0.012. So K2.5's 0.01 is at the noise floor for "orthogonal," Trinity's 0.123 is small but materially above it, and Cogito's 0.951 is overwhelming alignment. All three are meaningful only as comparisons against that baseline.

Figure 9: Three-way residualization comparison across all three models. Top row: base probe PCA (left), text-residualized PCA (center), and per-model cosine alignment between base and residualized PCs at each index (right). Bottom row: residualized PC1 loadings for Trinity (6.8%, flat — valence dissolves), K2.5 (48.2%, concentrated on awestruck/sentimental/invigorated/grief-stricken — a new axis), and Cogito (20.5%, loading on calm/cheerful/compassionate — valence preserved).

Interpretation

Raw probes are a mix of two things: structure predictable from the text itself, and structure the model is representing beyond what text alone predicts. The residualization step lets us see which part is which, and the three models use that mix in qualitatively different ways.

Trinity's emotion geometry is overwhelmingly text-driven (hence the low variance explained by its post-residualization PC1 axis). The clean circumplex is a reflection of structure that can largely be identified by reading into the text itself, not extra structure Trinity is imposing on top of text.
K2.5 has a dominant model-internal axis that isn't valence at all. Its circumplex in raw space is modestly there, but the model's own organization of emotional content runs along a different — and stronger — direction, visible only after text content is subtracted.
Cogito encodes valence as an intrinsic geometric axis. The direction persists essentially unchanged under residualization; only the amount of variance it explains decreases.

Under single-model testing, the multiplicity of possible outcomes would be obscured. Only cross-model comparison separates "what residualization does to probe geometry" from "what this specific model happens to represent." Among the three, Cogito is the only one whose raw circumplex geometry survives the text-content control. This suggests that what looks like a universal affective structure across language models may in fact be universal-looking because text about emotions has universal-looking structure, not because every model is independently organizing valence as an internal axis.

Discussion

We set out to extend Anthropic's emotion-probe investigation in three directions: across multiple models, via a text-residualization step separating text-predictable from model-internal activation structure, and across new probe families for steering and analysis. A few things seem worth stepping back from at the end.

The cross-model picture changes things. The circumplex replicates at the raw-probe level. However, under text-residualization, our three models behave qualitatively differently. Trinity's valence axis dissolves, K2.5's reorganizes around a different model-internal axis, and only Cogito preserves valence as a genuinely internal geometric structure. The "universal" affective geometry of emotion representations in LLMs is, on our data, at least partly text-driven rather than model-intrinsic — a distinction a single-model paper cannot draw.

On concealment. Anthropic's per-emotion deflection probes replicate as emotion-specific (approximately orthogonal to overt probes) on both Trinity and K2.5. Both models also show the same valence block structure in their raw deflection probes, which we interpret as a mechanical consequence of the contrast-vector construction rather than a model-specific property. Our new cross-emotion construction — hiddenness_X = deflection_X − overt_X — yields a universal concealment direction at cos ≥ +0.935 across all 15-emotion pairwise comparisons on both models, with within-valence means near +0.99. Universal concealment is a clean finding; its visibility requires the subtraction step because raw deflection probes inherit valence structure from the per-emotion dialogue construction.

Limitations. We cannot directly test Sonnet against our residualization procedure, so we can't place Anthropic's own model in the three-way split we see across Trinity, K2.5, and Cogito.

More broadly: multiple structures we initially read as "what phenomena like deflection or valence look like in LLMs" turned out to depend in specific ways on the model being studied. Single-model findings in mechinterp, including our own, deserve wariness until they're retested across architectures and with varied methodologies. Text-residualization and cross-architectural comparison are relatively cheap controls with outsized impact on what one is willing to claim.

As a closing note: We also ran tests on the locality of emotion features, quantifying and somewhat complicating Anthropic's claims that emotion features tend to be triggered primarily by tokens they're directly relevant to rather than having persistent effects across an entire rollout. Our write-up on this work can be found here.