Skip to main content

Command Palette

Search for a command to run...

Every code assistant you use predicts one token at a time. Google shipped one that does not

Updated
1 min read

DiffusionGemma generates text by diffusion, not autoregression. (https://blog.google/innovation-and-ai/technology/developers-tools/diffusion-gemma-faster-text-generation/)

Every large language model you have used works left to right.

It predicts one token, appends it, then predicts the next.

That sequential dependency is why latency scales with output length, and why a long completion feels slow no matter how fast the hardware is.

Diffusion models do not work that way.

They start from noise and refine a block of output in parallel, over a fixed number of denoising steps.

Google's experimental release claims up to 4x faster generation on GPUs for latency-critical tasks like code infilling and real-time editing.

If that holds, it changes the math for one specific job:

• Inline completion, where latency is the product • Code infilling, where the model fills a gap between known context • Real-time editing, where the user waits on every keystroke

Autoregression is not going away. It still wins on long-form coherence.

But for the tight, low-latency loop of an inline assistant, the rule that a model must generate left to right was never a law.

It was a default.

Most teams spent years optimizing the model.

Google changed how the model writes.