How Diffusion Models Actually Work ~ kitxor

The core loop is simple:

Take static noise + prompt → feed to Neural Network → NN predicts the noise → subtract it → get a slightly less noisy image.

Repeat this loop N times. Each pass removes a bit more noise. At the end, you have a clean image.

How the NN learns this

During training, the model sees millions of images with noise added at various levels. Its only job: predict what noise was added. Do this enough times, and the NN gets really good at spotting noise in any image.

Inference is just training in reverse

Start with pure noise. The NN predicts what noise it sees. Subtract that. Now you have a slightly less noisy image. The prompt steers which noise patterns get predicted—ask for "cat" and the NN predicts noise that, when removed, reveals cat-like shapes.

Why latent space matters

Running this loop on a 512×512 image directly would melt your VRAM. So instead, we compress everything to a smaller "latent" representation (like 64×64), run the denoising loop there, then scale back up to full resolution at the end. This is why it's called Latent Diffusion.

The practical bit

Stable Diffusion is the most consumer-friendly implementation of this. Runs on regular GPUs, massive community, tons of fine-tuned models. If you have a GPU with 8-12GB VRAM, you can run this locally for free.

Key Takeaways

Core loop: noise + prompt → predict noise → subtract → repeat N times
Training teaches the NN to recognize noise patterns at all levels
Prompts steer which noise patterns get predicted and removed
Latent space compression (512×512 → 64×64) makes it GPU-friendly
Stable Diffusion runs locally on 8-12GB VRAM consumer GPUs

back to all posts