The core loop is simple:
Take static noise + prompt → feed to Neural Network → NN predicts the noise → subtract it → get a slightly less noisy image.
Repeat this loop N times. Each pass removes a bit more noise. At the end, you have a clean image.
How the NN learns this
During training, the model sees millions of images with noise added at various levels. Its only job: predict what noise was added. Do this enough times, and the NN gets really good at spotting noise in any image.
Inference is just training in reverse
Start with pure noise. The NN predicts what noise it sees. Subtract that. Now you have a slightly less noisy image. The prompt steers which noise patterns get predicted—ask for "cat" and the NN predicts noise that, when removed, reveals cat-like shapes.
Why latent space matters
Running this loop on a 512×512 image directly would melt your VRAM. So instead, we compress everything to a smaller "latent" representation (like 64×64), run the denoising loop there, then scale back up to full resolution at the end. This is why it's called Latent Diffusion.
The practical bit
Stable Diffusion is the most consumer-friendly implementation of this. Runs on regular GPUs, massive community, tons of fine-tuned models. If you have a GPU with 8-12GB VRAM, you can run this locally for free.
Key Takeaways
- Core loop: noise + prompt → predict noise → subtract → repeat N times
- Training teaches the NN to recognize noise patterns at all levels
- Prompts steer which noise patterns get predicted and removed
- Latent space compression (512×512 → 64×64) makes it GPU-friendly
- Stable Diffusion runs locally on 8-12GB VRAM consumer GPUs