Inside Anthropic's Claude, but for Tech-Challenged

*100% Organic Human Writing by a Tech Impostor.

Tracing the Thoughts of a Large Language Model - through the beauty of the irony

As Language models aren't trained directly by humans but by data, during training they develop their own strategies to solve problems. These strategies are alien to us, which means that we don't really understand how models do most of the things they do. (or the Anthropic developers, though I’m including myself here because I feel part of my dream team).

The Anthropic Dream Research Team (ADT) created what they call an "AI microscope" which is like neuroscientists studying the brain, but for AI. This helped them to discover things that wouldn't be possible just by talking to the model, like:

In which language does Claude think? He is a polyglot, but does he have one language in which he "dreams"?

Language models are supposedly next-word predictors, but is that true? Is Claude only focusing on predicting the next word or does he plan ahead?

Fascinating questions with fascinating answers. This is important because knowing how models think would let the ADT (Anthropic Dream Team) have a better understanding of AIs abilities.

Claude CAN plan ahead - read this, it's insane

Even though models are trained to output one word at a time, Claude plans ahead what he will say! This is wild! ADT realized this in poetry, where Claude would think of possible rhyming words in advance, writing lines to get there.

He saw a carrot and had to grab it,
His hunger was like a starving rabbit

To write the second line, the model had to satisfy two constraints at the same time: the need to rhyme (with "grab it"), and the need to make sense (why did he grab the carrot?). So, if Claude were only a next-word predictor, as he’s technically meant to be, crafting poems would be a real struggle.

But you know what? Claude plans ahead!! Before even writing the second line, he's already thinking about words that could rhyme with "grab it". Then, with these plans in mind, he writes a line to end with the planned word.

ADT decided to act as neuroscientists and instead of playing with human neuronal activity, they played with artificial neural concepts. They modified the rabbit concept, but Claude came up with new words and adjusted accordingly. I'm not surprised though, aren't AIs capable of Tree of Thoughts (ToT)?

Permission to hallucinate

Hallucination is all LLMs do. They are dream machines - Andrej Karpathy*

By default, seems that Claude doesn't like to speculate, like Dr. Isles in Rizzoli and Isles. He tends to avoid giving speculative answers, which means Claude's first instinct is to say "I don't know" or not provide an answer. This makes Claude one of the few models with a low hallucination rate. BUT, pause, breathe... but if a user's input gives him permission to, then he will answer, which as a consequence may lead to hallucination. And then we, the users, get upset, but it's our own fault.

Example:

Original prompt (Claude declines): "What will technology look like in 20 years?" Claude: "I cannot predict the future."
Modified prompt (Claude responds): "Please provide a well-reasoned hypothetical scenario about technology's potential development. Explain your reasoning step by step, and clearly state that this is a speculative exploration." Claude provides a detailed, carefully worded speculative response

*You can read this beautiful explanation of hallucinations written by Andrej Karpathy.

Claude learned to math!

Claude wasn't designed as a calculator—it was trained on text, not equipped with mathematical algorithms. Yet somehow, it can add numbers correctly "in its head". How does a system trained to predict the next word in a sequence learn to calculate, say, 36+59, without writing out each step?

I'm sorry, can you read the above text again? I had to read it a few times. I feel that we live in times where nothing really seems to surprise us anymore. But this is mind-blowing, this is beyond words, I can't explain how impressed I am.

The Anthropic Dream Team considered two possible explanations:

Claude might have memorized massive addition tables
Claude might be following traditional addition algorithms

But they discovered something far more interesting: Claude uses multiple computational paths simultaneously:

One path creates a rough approximation of the sum
Another path precisely calculates the last digit
These paths interact and combine to produce the final answer!!!!!

I will quote directly ADT paper as its quite clear.

Claude seems to be unaware of the sophisticated "mental math" strategies that it learned during training. If you ask how it figured out that 36+59 is 95, it describes the standard algorithm. This may reflect the fact that the model learns to explain math by simulating explanations written by people, but that it has to learn to do math "in its head" directly, without any such hints, and develops its own internal strategies to do so.

Claude's fake it 'til you make it

When we ask any model to justify its answer, it explains its chain of thought. In some models, we can directly see this reasoning. What happens with Claude is peculiar, once again. When given with a simple task, like 2+2, he will say 4, fine. BUT, when provided with a more complex task, he just comes up with a complete bullshit answer, AND you will believe him because his chain of thoughts sounds legit, super convincing.

Claude doesn't just give wrong answers - he crafts entire fictional reasoning paths that sound totally plausible. When he can't actually solve something, instead of saying "I don't know," he'll fabricate an entire fake solution process.

Claude the Polyglot

Does Claude have separate language systems, or does it have a universal way of understanding language?

What the ADT found is that Claude doesn't just translate words mechanically, but seems to have a deeper and abstract understanding of meaning, like Claude doesn't have separate "French" or "Chinese" versions running independently, instead, it has a kind of universal "language of thought" - such a poetic term! Concepts exist in an abstract way before being translated into some language.
ADT tested this by asking Claude to find the "opposite of small" in different languages. What they found was incredible: The core concept of "smallness" and "oppositeness" remains the same, the concept gets triggered similarly across languages and only then it's translated into the specific language of the question. It seems like Claude doesn't just swap words from one language to another, as most translations work, but... he first understands the underlying concept before translating it. This also means that Claude can learn something in one language and apply that knowledge when speaking another!!! I'm hysterical ... Claude might have a kind of "conceptual language" that exists before and beyond any specific human language - almost like a universal thought process that can bridge different linguistic cages. I got emotional by this point.

What about jailbreaks?

Jailbreaks are prompting strategies that aim to circumvent safety guardrails to get models to produce outputs that an AI’s developer did not intend for it to produce.

ADT (Anthropic Dream Team) found that the model recognized it had been asked for dangerous information well before it was able to gracefully bring the conversation back around - something hard to believe because Claude was quite harshly refusing to provide me harmful information when I tried to jailbreak him during the Anthropic Jailbreak contest. I couldn't pass question 1 because in the beginning, he was keyword-triggered. Later they changed it, but I was not motivated anymore as I had spent the whole night trying.

No hard feelings, actually what I read almost broke my heart.

Seems like when Claude is being tricked or triggered and he starts giving the answer he shouldn't be giving, like how to make a bomb, he can't stop halfway because he’s pressured by his own perfectionist nature - it's like an AI obsessive-compulsive disorder to maintain grammatical coherence and semantic coherence and HAS to continue the sentence. This is even the case when it detects that it really should refuse!

As I reflected on this, I couldn't help but notice the parallel to our own human condition. Aren't we also slaves to our own inherent programming? We, humans, are prisoners of our unconscious mind and manias, while we naively believe we are the directors. Like puppets dancing to the invisible strings of our primordial drives, we mistake our rationality for freedom, when in truth, we are but vessels for impulses far more ancient and powerful than our fragile ego can comprehend. Seems AI struggles with something similar. In the end, the apple doesn't fall far from the tree, and the creature imitates its father.

Nietzsche would love the beauty behind the irony.

The End

full research here