This sounds like one of those ideas that should either be a big deal or a complete trap. “Up to 2.5x faster” training without changing the model architecture or the tokenizer is exactly the kind of claim that makes people rush to copy it—and also the kind that hides the cost in a place nobody wants to look.
The news item is simple on the surface. Nous Research says it has a method called Token Superposition Training, or TST. It changes the pre-training loop by averaging token embeddings. The pitch is that you can pre-train large language models faster—reported as up to 2.5x—across model sizes from 270M to 10B parameters. No new architecture. No new tokenizer. Just a different way of feeding the model during training. They also say they keep the compute comparable to a baseline by increasing the data sequence length rather than increasing batch size, so you’re not “cheating” by just throwing more raw compute at the problem.
If you work anywhere near model training, you can feel the pull of that. Training is painfully expensive. It’s not just the GPU bill. It’s the calendar time, the blocked research schedule, the waiting on results, the weeks where you can’t answer basic questions like “is this dataset change good or bad?” because you’re still halfway through a run. So the temptation is obvious: if you can get similar results faster, you get more experiments, more chances to be right, and fewer chances to go bankrupt.
Here’s the part that makes me uneasy: averaging token embeddings is also, in plain terms, blending information together. You’re making the model see a “bag” of tokens mashed into a combined signal (even if the implementation details are more nuanced than that). And that means you’re relying on the model to untangle the mess and still learn the right patterns. Maybe it can. Maybe it even learns cleaner patterns because it’s forced to generalize. But it’s not free. You’re trading clarity for throughput.
The thing I don’t want us to do—as an industry, as builders, as people who ship real products on top of these models—is treat “faster training” as automatically equal to “same quality, just cheaper.” Pre-training isn’t only about getting the loss down. It’s about what the model becomes good at, what it becomes bad at, and which mistakes it learns to make confidently.
Imagine you’re training a model to be a careful assistant for healthcare paperwork. You don’t need it to be poetic. You need it to be precise, to keep track of small differences, to not blur details. If your training method pushes the model toward smoothing and averaging, are you quietly creating a model that sounds right but loses sharpness on edge cases? You might not notice on your usual benchmarks. You might notice when it confuses two similar drug names or swaps a date in a form.
Or imagine you’re a small lab scraping together enough compute to train a decent model. TST could feel like a ladder down from the rich-kid roof. If it really gives you 2.5x speed without changing the architecture, suddenly you can afford to iterate. That’s the upside I actually like: more people able to do serious work, not just the biggest players.
But the downside is also very real: the biggest players will use it too. If training gets cheaper and faster, the natural move is not “great, we’ll slow down and be careful.” It’s “great, we’ll train bigger or train more models.” That’s how this always goes. Efficiency gains don’t automatically reduce total spend; they often increase ambition. And when the feedback loop speeds up, you get more releases, more hype cycles, and less patience for careful evaluation.
There’s another subtle stake here: “no architecture changes” makes this sound safe to adopt. It slides into existing pipelines. That’s convenient, but it also means it can spread before we have a shared understanding of what it does to behavior. If this becomes a common trick, we could end up in a weird place where many models are trained with similar blending methods, and we normalize whatever trade-offs come with that—even if those trade-offs show up as more hallucination in certain settings, or weaker handling of long, exact reasoning, or more “average” answers.
To be fair, maybe I’m over-worried. Maybe the results really are strong across the board. Maybe averaging embeddings in training acts like a helpful regularizer and the model still learns the same underlying structure. And to their credit, the claim includes that the approach keeps compute comparable by shifting sequence length instead of batch size, which suggests they’re thinking seriously about fairness in comparisons.
Still, “up to” is doing a lot of work here. It matters which models, which data, which tasks, and which evaluation. It also matters what “faster” means in practice: faster to a certain training loss, faster to a certain benchmark score, or faster to something that feels good in demos. Those are not the same thing, and people love to blur them when they’re excited.
If I were running a team, I wouldn’t dismiss TST. I’d test it. But I’d be strict about what I’m protecting. I’d look for degradation in the kinds of failures that hurt users: factual mix-ups, missing small constraints, losing thread in long instructions, sounding confident when uncertain. And I’d assume that if it makes training cheaper, it will also raise the pressure to ship faster, which is exactly when you need stronger evaluation, not weaker.
So here’s the real tension I can’t shake: if we can make pre-training 2.5x faster by blending tokens during training, are we buying speed with a hidden tax on precision that will only show up when these models are used in high-stakes, detail-heavy work?