MA

Moonshot AI’s Attention Residuals Improve Transformer Scaling with Depth-Wise Attention

Published on:

This is the kind of AI research update that sounds small, almost boring, and then quietly changes what “good enough” looks like.

Because what Moonshot AI is really saying with Attention Residuals is: the usual way Transformers “carry” information forward through a deep model might be too dumb. Not broken. Just blunt. And if you care about scaling—bigger models, deeper models, harder tasks—blunt starts to look like wasted money and wasted compute.

Here’s the plain version of what’s being reported publicly. In a common Transformer setup (PreNorm), each layer doesn’t just process the input; it also adds a residual connection—basically a running mix of what came before. That mixing is “fixed” in the sense that the model architecture assumes a simple add-and-move-on pattern. Moonshot AI proposes replacing that fixed residual accumulation with depth-wise attention. Instead of automatically blending in the past the same way every time, the model can pay attention to previous representations and selectively reuse them.

They call this Attention Residuals, and they describe two variants. Full AttnRes is the full version. Block AttnRes is a more practical version that aims to cut memory and communication overhead while keeping most of the gains. And in scaling experiments, they report lower loss, better gradient behavior, and better results on reasoning, coding, and evaluation benchmarks.

My take: this is a good idea, and also a slightly uncomfortable one.

Good, because the residual stream is basically the model’s working memory as it moves through depth. If you always mix “the past” into “the present” the same way, you’re forcing a one-size-fits-all memory policy. That’s weird when you think about it. Some layers should probably lean heavily on what came before. Other layers should ignore it and rewrite. A fixed rule can’t know the difference, so it does the safest thing: it drags a little of everything forward forever.

That safety comes with a cost. Imagine you’re training a big model on a mix of tasks. Some samples need careful step-by-step logic. Some are fast pattern matches. With fixed residual mixing, you’re always pushing a thick soup through the depth of the network. Depth-wise attention says: stop doing that. Decide what to keep. Decide what to revisit. Decide what to drop.

If that holds up, it’s not just “better performance.” It’s a shift in how we think about stability in deep networks. Moonshot AI claims improved gradient behavior, which is basically the difference between a model that trains smoothly and one that fights you the whole way. And that matters because scaling isn’t just about buying more chips. It’s about not wasting the chips you already bought.

Now the uncomfortable part: when you make the residual path “smart,” you’re also making the model’s internal traffic more complex. Complexity isn’t free. It can hide bugs, hide failure modes, and make it harder to predict what the system will do when it’s under stress.

Picture a team trying to ship a coding model into a product. If the architecture is closer to the standard recipe, debugging is already hard—but at least you’re working in familiar territory. If the residual stream is now routed through attention across depth, you’ve introduced a new place where the model can overfit, or latch onto weird shortcuts. Maybe it “cheats” by repeatedly pulling a certain kind of intermediate representation that works on benchmarks but breaks on messy real code. The research says benchmarks improve. That’s promising. But benchmarks are also where clever tricks go to look honest.

There’s also a practical tension in the two variants they propose. Full AttnRes sounds like the pure idea. Block AttnRes is the real-world compromise to reduce memory and communication overhead. That’s the part I actually care about, because scaling dies in the gap between “works” and “works at a cost you can live with.” If Block AttnRes keeps most of the gains, it suggests this isn’t just a lab trick. But if the best results require the full version, then we’re back to the familiar story: fancy method, painful bill.

And yes, I can already hear a reasonable pushback: isn’t this just adding more attention where we already have attention? Maybe. But the location matters. This isn’t “more attention to tokens.” It’s attention to depth—to the model’s own past selves. That’s a different lever. In human terms, it’s the difference between reading a sentence carefully and deciding which of your earlier thoughts to trust.

If Moonshot AI’s claims generalize, the winners are obvious: anyone training deep Transformers for hard tasks like reasoning and coding, where the model needs to keep a clean chain of internal states. The losers could be the people who rely on the simplicity of today’s standard architectures—teams that don’t want to add another moving part, or that need predictable behavior more than they need another bump on a leaderboard.

What I don’t know—what I want to know—is how this behaves when things get weird: distribution shifts, longer contexts, adversarial prompts, noisy data. Selective reuse is powerful, but it can also become selective blindness if the model learns the wrong habits.

So the question I actually care about is this: if we let Transformers “choose” what to carry forward through depth, do we get models that reason more reliably, or models that get better at hiding their mistakes?