Show HN: Duplicate 3 layers in a 24B LLM, logical deduction .22→.76. No training

252 points - last Wednesday at 9:31 PM

I replicated David Ng's RYS method (https://dnhkng.github.io/posts/rys/) on consumer AMD GPUs (RX 7900 XT + RX 6950 XT) and found something I didn't expect.

Transformers appear to have discrete "reasoning circuits" — contiguous blocks of 3-4 layers that act as indivisible cognitive units. Duplicate the right block and the model runs its reasoning pipeline twice. No weights change. No training. The model just thinks longer.

The results on standard benchmarks (lm-evaluation-harness, n=50):

Devstral-24B, layers 12-14 duplicated once: - BBH Logical Deduction: 0.22 → 0.76 - GSM8K (strict): 0.48 → 0.64 - MBPP (code gen): 0.72 → 0.78 - Nothing degraded

Qwen2.5-Coder-32B, layers 7-9 duplicated once: - Reasoning probe: 76% → 94%

The weird part: different duplication patterns create different cognitive "modes" from the same weights. Double-pass boosts math. Triple-pass boosts emotional reasoning. Interleaved doubling (13,13,14,14,15,15,16) creates a pure math specialist. Same model, same VRAM, different routing.

The circuit boundaries are sharp — shift by one layer and the effect disappears or inverts. Smaller models (24B) have tighter circuits (3 layers) than larger ones (Ng found 7 layers in 72B).

Tools to find circuits in any GGUF model and apply arbitrary layer routing are in the repo. The whole thing — sweep, discovery, validation — took one evening.

Happy to answer questions.

Source

Comments

simgt last Thursday at 9:59 AM

> I replicated David Ng's RYS method [...] found something I didn't expect.

> Transformers appear to have discrete "reasoning circuits" — contiguous blocks of 3-4 layers that act as indivisible cognitive units. Duplicate the right block and the model runs its reasoning pipeline twice. No weights change. No training. The model just thinks longer.

How did you not expect that if you read his post? That's literally what he discovered, two years ago.

For anyone interested, there's more meat in the post and comments from last week: https://news.ycombinator.com/item?id=47322887

4bpp last Thursday at 2:43 AM

Assuming the benchmarks are sound (rather than capturing a fluke), the provided explanation still does not pass the smell test. As far as I can tell, there is nothing about the training process of these models that would encourage them to make the output of any layer apart from (n-1) meaningful as the input of layer n, unless perhaps these layers were initialised as identity and the training process did not get to change them much. (Plausible for middle layers?)

Considering this, I think (again, assuming the benchmarks themselves are sound) the most plausible explanation for the observations is (1) the layers being duplicated are close to the identity function on most inputs; (2) something happened to the model in training (RLHF?) that forcefully degraded its reasoning performance; (3) the mechanism causing the degradation involves the duplicated layers, so their duplication has the effect of breaking the reasoning-degrading mechanism (e.g. by clobbering a "refusal" "circuit" that emerged in post-training).

More concisely, I'm positing that this is an approach that can only ever break things, and rather than boosting reasoning, it is selectively breaking things deleterious to reasoning.

Karuma last Thursday at 1:52 AM

Wow, every single word in the original post and on that README.md is pure LLM. How sad.

In any case, this has been done at least since the very first public releases of Llama by Meta... It also works for image models. There are even a few ComfyUI nodes that let you pick layers to duplicate on the fly, so you can test as many as you want really quickly.

aimarketintel today at 2:11 AM

Interesting technique. For practical applications, structured tool access (MCP) matters more than model size — a 7B model with real-time data often beats 70B without it.

taliesinb last Thursday at 1:57 AM

There is an obvious implication: since the initial models were trained without loops, it is exceedingly unlikely that a single stack of consecutive N layers represents only a single, repeatable circuit that can be safely looped. It is much more likely that the loopable circuits are superposed across multiple layers and have different effective depths.

That you can profitably loop some say 3-layer stack is likely a happy accident, where the performance loss from looping 3/4 of mystery circuit X that partially overlaps that stack is more than outweighed by the performance gain from looping 3/3 of mystery circuit Y that exactly aligns with that stack.

So, if you are willing to train from scratch, just build the looping in during training and let each circuit find its place, in disentangled stacks of various depths. Middle of transformer is:

(X₁)ᴹ ⊕ (Y₁∘Y₂)ᴺ ⊕ (Z₁∘Z₂∘Z₃)ᴾ ⊕ …

Notation: Xᵢ is a layer (of very small width) in a circuit of depth 1..i..D, ⊕ is parallel composition (which sums the width up to rest of transformer), ∘ is serial composition (stacking), and ᴹ is looping. The values of ᴹ shouldnt matter as long as they are > 1, the point is to crank them up after training.

Ablating these individual circuits will tell you whether you needed them at all, but also roughly what they were for in the first place, which would be very interesting.

kgeist last Thursday at 3:26 AM

Heh, for a couple last days, I've been doing this exact kind of "neuroanatomy" on Qwen2.5/Qwen3 too. Fascinating stuff. To make it easier to fiddle with the network, I created a small inference engine that is stripped of all the framework magic, just raw matmuls and all (main inference loop is just 50 lines of code!). For example, it's trivial to remove a layer: i just skip it in code with a simple "if". I've found that removing some layers doesn't appear to change anything (based on the vibes at least). If you remove some later layers, the model forgets how to insert the EOS token and keeps chatting ad finitum (still coherently). Removing earliest layers makes the model generate random garbage. Turns out abliteration is not hard to do, 10 examples was enough to find the refusal vector and cancel most refusals. Interestingly, I've found that refusal happens in the middle layers too (I think, layer 12 out of 26)

From what I understand, transformers are resistant to network corruption (without complete collapse) thanks to residual connections.

I tried to repeat some layers too but got garbage results. I guess I need to automate finding the reasoning layers too, instead of just guessing.

kristianp last Thursday at 4:03 AM

The method used here by David Ng, was discussed a few days ago at https://news.ycombinator.com/item?id=47322887

woadwarrior01 last Thursday at 12:55 AM

Reminds me of Solar 10.7B, which was a very good model for its size ~2 year ago and the "Depth Up-Scaling" technique behind it. Although, that involved continued training after repeating the layers.

https://arxiv.org/abs/2312.15166

christianqchung last Thursday at 2:37 AM

Why test on Qwen 2.5 when Qwen 3 has been out for about a year, and Qwen 3.5 for a month? My problem with this is ironically entirely vibes based: that for some reason, LLMs love to talk about Qwen 2.5 instead of anything newer.

hackpert last Thursday at 12:46 PM

We found evidence of specific layer-localized "reasoning" circuits in a few models last year too! A very much work-in-progress paper is here: https://openreview.net/forum?id=mTjGBrkdtz

SyzygyRhythm last Thursday at 1:13 AM

If running twice is good, then is running N times even better? I wonder if you could even loop until some kind of convergence, say hitting a fixed point (input equals output). I wonder if there's even a sort of bifurcation property where it sometimes loops A->A->A, but other times A->B->A, or more, rather like the logistic map fractal.

nowittyusername last Thursday at 1:39 AM

There's still a lot of low hanging fruit left IMO. Good find and rather funny to think about as you can have someone simply clone the various layers multiple times and instead of spending millions of dollars retraining the model increase performance significantly with "this one trick".

Lerc last Thursday at 11:05 AM

That weird part is kind of what I was expecting.

This goes to the thing that I posted on the thread a couple of days ago. https://news.ycombinator.com/item?id=47327132

What you need is a mechanism to pick the right looping pattern, Then it really does seem to be Mixture of experts on a different level.

Break the model into input path, thinking, output path. and make the thinking phase a single looping layer of many experts. Then the router gets to decide 13,13,14,14,15,15,16.

Training the router left as an exercise to the reader.

deleted last Thursday at 4:40 PM

Imanari last Thursday at 9:37 AM

Fascinating! I wonder if new training techniques could emerge from this. If we say layer-1=translater, layer2-5=reasoner, layer6 retranslater, could we train small 6 layer models but evaluate their performance in a 1>n*(2-5)>6 setup to directly train towards optimal middle-layers that can be looped? You'd only have to train 6 layers but get the duplication-benefit of the middle layers for free.

snats last Thursday at 2:59 AM

you can also have removed layers of models and keep the same score in benchmarks [1].

i feel that sometimes a lot of the layers might just be redundant and are not fully needed once a model is trained.

[1] https://snats.xyz/pages/articles/pruningg.html

rao-v last Thursday at 1:13 AM

I’d love to believe this is real, but I’m pretty sure you will lose performance on a “fair” mix of tasks, even after fine tuning. I know multiple teams have explored recurrent layers (great for limited VRAM) but I don’t think it’s ever been found to be optimal.

zhangchen last Thursday at 1:35 AM

this lines up with what pruning papers have been finding, the middle layers carry most of the reasoning weight and you can often drop the outer ones without much loss. cool to see the inverse also works, just stacking them for extra passes.

m3kw9 last Thursday at 2:47 AM

What, just randomly choose some "layer" and duplicate it and give some arbitrary reasoning went from 0.2 -> 0.7, i don't know man. You need to use real benchmarks.

getnormality last Thursday at 3:44 AM

Didn't we recently see another hack, where you could get better performance by repeating the prompt?

I wonder if they work for similar reasons.

puppykito last Thursday at 12:47 PM

I find it so cute that making the LLM think twice before outputting something makes it smarter.

colejhudson last Thursday at 1:04 AM

Would you be able to publish the individual benchmarks for Qwen2.5-Coder-32B? GSM8K specifically would be useful to look at.

deleted last Thursday at 1:53 AM

XCSme last Thursday at 1:28 AM

But if it got worse on other tests, it doesn't do much good, right?

BoredomIsFun last Thursday at 12:47 PM

please post it on /r/localllama

gukoff last Thursday at 12:50 PM

How do you run these models on AMD GPUs?

seertaak yesterday at 8:51 AM

This -- and obviously David Ng's article -- are absolutely fascinating pieces of work.

I have a few (very naive) questions:

There is a widespread intuition, encapsulated in the very terms "feed-forward networks" and "deep neural networks", that computation in such networks is akin to a circuit wired in series. My "observation" is that residual layers offer an "escape hatch" from this, allowing layers (or sets of layers), to operate in parallel (and of course, something in between).

So here are my dumb questions:

1. Is my intuition about residual networks, at least in principle, allowing for in parallel layers, correct? Or am I missing something fundamental? Let's say the intuition is correct -- is it possible to measure the degree to which a layer operates in series or in parallel?

2. The formula for residual layers (at least to my mind) reminds of an Ornstein-Ühlenbeck time series process. If so, can we measure the degree of mean-reversion of a/several layer(s)? For me, this makes intuitive sense -- the goal of avoiding vanishing gradients feels similar to the goal of stationarity in time series processes.

3. Let's take as an article of faith the central idea of a tripartite network: input->latentspace block => reasoning block => latentspace->output block. Ng's intuition iiuc is that the reasoning block, more or less, wired in series. Intuitively, it feels like that is what it ought to be (i.e., a chain of calculations), though I'll add -- again hand-wavingly -- that OP's efforts appear to cast doubt on this conjecture. Are the two "translation" blocks wired "more" in parallel, then?

4. So what both Ng and OP did was to "tape together" the ostensibly reasoning layers -- in different ways but that's essentially it. Another thing you could do is to treat the input and output translation blocks as fixed. You now train a totally new model on a much smaller corpus of training data, only instead of feeding the input directly to your new model you feed it translated training data (similarly, your targets are now the activations at the entrance to the reasoning->output block. Let's assume it's exactly the same architecture in the middle as the standard netowrk, only it's initialized to random weights as per usual. Surely you should be able to pre-train that 6 layer reasoning network much, much faster. Has anyone tried this?

5. Having thus partitioned a very deep architecture into three distinct parts, there's no reason why you can't experiment with making the reasoning block wider or narrower. Has anyone tried that?

6. Another fun idea is to map a given input through input block and read the pre-reasoning activations. You now let that vector be a random variable and do a random walk through reasoning input space, and use this to "augment" your corpus of training data. Reasonable idea or bullshit?

Please remember, I'm only just (and belatedly) trying to wrap my head around how transformer architectures work -- I'm still waiting for my copy of "Build a Large Language Model (from scratch)"! I hope these questions aren't totally daft!

edg5000 last Thursday at 12:51 PM

This is very cool

jacquesm last Thursday at 8:03 PM

> No weights change. No training. The model just thinks longer.

...

Singlaw last Thursday at 2:30 AM

What does this do?

BoredomIsFun last Thursday at 12:42 PM

Phi-4-25 is another example.

rafaamaral last Thursday at 2:24 AM

[flagged]

minnzen last Thursday at 3:29 PM

[dead]

dhsorens30 last Thursday at 8:50 AM

[dead]

elonisaass last Thursday at 9:33 AM

[dead]

the_harpia_io last Thursday at 1:27 PM

[flagged]

builderhq_io last Thursday at 8:30 AM

[dead]

Iamkkdasari74 last Thursday at 9:11 AM

[dead]

realaliarain74 last Thursday at 3:38 AM

[dead]

accesspatchh last Thursday at 1:04 AM

[flagged]