Every code assistant you use predicts one token at a time. Google shipped one that does not

DiffusionGemma generates text by diffusion, not autoregression. (https://blog.google/innovation-and-ai/technology/developers-tools/diffusion-gemma-faster-text-generation/)

Every large language model you have used works left to right.

It predicts one token, appends it, then predicts the next.

That sequential dependency is why latency scales with output length, and why a long completion feels slow no matter how fast the hardware is.

Diffusion models do not work that way.

They start from noise and refine a block of output in parallel, over a fixed number of denoising steps.

Google's experimental release claims up to 4x faster generation on GPUs for latency-critical tasks like code infilling and real-time editing.

If that holds, it changes the math for one specific job:

• Inline completion, where latency is the product • Code infilling, where the model fills a gap between known context • Real-time editing, where the user waits on every keystroke

Autoregression is not going away. It still wins on long-form coherence.

But for the tight, low-latency loop of an inline assistant, the rule that a model must generate left to right was never a law.

It was a default.

Most teams spent years optimizing the model.

Google changed how the model writes.

Every code assistant you use predicts one token at a time. Google shipped one that does not

Comments

More from this blog

An AI agent reverse-engineered malware that most antivirus tools missed

Anthropic shipped two models on June 9. The one you cannot use is the more interesting

Two vendors named the same thing. The agent isn't the model. It's the harness

Asking your AI to summarize a web page is now an attack surface

Command Palette

Comments

More from this blog