Generative AI for all: Diffusion models

11 minute read

Published:

Introduction

Imagine you have a beautiful drawing of a Bluey — a playful six-year-old Blue Heeler pup. Now, imagine you slowly sprinkle tiny dots of dust all over it, little by little, until the whole picture turns into a messy cloud of dust. That’s called diffusion — the process of turning a clear image into a noisy one.

Forward process of slowly adding random dust particles in steps to the image.

Now here’s the cool part: what if you had a magical machine that could look at that cloud of dust and slowly clean it up, step by step, until the Bluey drawing comes back? That’s what a diffusion model does! Guess what? These magic models can even make pictures just by listening to words! But shhh… let’s keep that secret in our pocket for later. But how can this magical machine make even more pictures of Bluey that I didn’t show it? If you’re still curious, keep reading—it’s a pretty cool story!

Reverse process to remove the dust particles step by step in the image.

On learning of magical machines

Imagine a lively classroom filled with curious kids. One day, the teacher walks in with a big smile and announces a challenge:

Today, I want you to draw something amazing: “Bluey sitting in a spaceship flying near the moon.”

The kids cheer — but the teacher has a twist. Before they start drawing, the teacher brings in lots of pictures of Bluey, spaceships, and moons. The teacher then divided the class into two teams and announces that each team will learn about the pictures in a different way.

Team CNN: The Keyhole Explorers

For the first team, the teacher had a quirky idea. She handed each student a tiny cardboard keyhole and said,

You’ll look at the pictures one tiny piece at a time — like peeking through a secret portal!

So the students lined up, one by one, peeking through their keyholes at pictures of Bluey, spaceships, and moon. One student saw a fuzzy curve. “Hmm… that looks like an ear!” the teacher whispered. Another spotted a shiny patch. “That’s probably the spaceship’s window,” she said. Every time they peeked, the teacher gave them clues — helping them understand what that little piece might be. Over time, the students became mini detectives, learning to recognize each part of the image from just a glimpse. They didn’t see the whole picture at once, but they got really good at figuring out the details. They became puzzle solvers who knew each piece by heart — even if they never saw the full puzzle all at once.

Teams

Team Transformers: The Puzzle Patchers

For the second team, the teacher had a totally different idea — and it felt like a game of group detective work. She snipped all the pictures of Bluey, spaceships, and moons, into small square patches and handed one to each student. Then she said with a grin:

Your patch is just one tiny piece of the puzzle. Talk to your friends, figure out what your patch might be, and where it fits. Then tell me what you think — and I’ll help you get it right!

So the classroom buzzed with excitement. One student shouted, “Mine looks like part of an eye!” Another replied, “Yours? That might be the moon’s surface!” They huddled together, comparing patches, swapping ideas, and slowly piecing together the big picture — like assembling a giant jigsaw puzzle without the box cover. Every time they thought they had it figured out, they’d run to the teacher, who’d give them feedback:

Hmm… close! But that patch belongs to the spaceship’s wing, not Bluey’s tail.

Back they’d go, chatting and adjusting, learning not just from their own patch but from everyone else’s. They weren’t just solving their own piece — they were learning how all the pieces fit together.

The magic drawing challenge

Once the teacher finished her classes with both the teams, the teacher clapped her hands for everyone’s attention.

“Now,” she announced, “it’s time for The Magic Drawing Challenge!”

She didn’t show any new picture. Instead, she re-read the challenge aloud again.

“Bluey is sitting in a spaceship flying near the moon.”
“Your job,” she said, “is to turn my words into a beautiful drawing.”

The room went silent for a heartbeat—and then pencils and colours began to dance. The CNN team drew excellent details: the shiny panels of the spaceship, the soft curve of an ear, the pitted moon texture. The Transformer team sketched a coherent scene quickly: where Bluey sits, how the spaceship faces the moon, how everything fits together. The teacher kept circling, offering tiny improvements like:

“Make Bluey sit inside the spaceship.”
“Bring the moon closer so it’s clearly near.”
“Let’s show the spaceship actually flying — add stars, a glow, a trail.”

With each small suggestion, the students’ drawings moved closer and closer to the teacher’s instruction. But something interesting happened: by the end, the final drawings from all students in each team looked strikingly similar (as shown in the figure below).

Backbones

  • Team CNN: Their drawings had excellent local details — the spaceship panels were sharp, the moon’s craters were textured, and Bluey’s fur looked realistic. But the overall composition often felt fragmented, as if the parts didn’t fully belong together, like Bluey sitting over the spaceship, instead of sitting inside.
  • Team Transformers: Their drawings captured the global layout well — Bluey inside the spaceship, the moon in the background, and a sense of motion. However, some fine details were missing or simplified, making the image less rich in texture.

Despite these differences, there was hardly any innovation. Why? Because all students started from similar mental templates of the original pictures they had memorized. When asked to draw from the instruction, they simply reassembled what they already knew, leading to convergent, almost identical outputs for each team, which got refined step by step under the teacher’s guidance.

The dusty trick

Although the results from both the teams look good, the teacher was mildly disappointed as she didn’t get to see the creative part of all the students. So, she re-planned her teaching strategy for the two groups. This time, she didn’t just show them clean pictures of Bluey, spaceships, and the moon. Instead, she sprinkled magic dust over the images before showing them. This dust made the pictures look blurry and speckled, so the students had to guess and reason about what they were seeing.

  • For the CNN team, like earlier, the teacher revealed the dusty images through a cardboard keyhole, so they learned to recognize local features even in noisy conditions - like spotting an ear or a wing despite the blur.
  • For the Transformer team, the teacher gave them dusty patches and asked them to talk to each other to figure out what each patch might represent. This taught them to share context and handle uncertainty together.

Team dusty

Later, when the teacher gave the final magic drawing challenge again - she didn’t let them start on a clean sheet. Instead, she covered their papers with random dusty patterns and said:

“Now I’ll tell you a secret. Before you even start drawing, your brain has learned a special sketchbook from the dusty examples which you have seen - a latent place where you imagine things before putting them on paper. It’s like a magical notebook where ideas live as fuzzy shapes and feelings, waiting to become real pictures.”

The students gasped. “So we will draw from our imagination?”

“Exactly!” said the teacher. “And here’s another trick — the dust I gave you wasn’t random. I used a dusty clock — a timer that decides how much dust to add or remove at each step. At first, it’s super dusty, and then it gets cleaner and cleaner. That’s called a noise schedule. It helps you slowly uncover your sketch, with my feedback, one layer at a time.”

The students nodded, imagining their sketchbooks filled with swirling clouds of ideas and a magical clock ticking as they cleaned and created. Each student began with a different dust pattern, so their starting points were unique. With every round of feedback, they removed some dust and added clearer details, gradually transforming chaos into a meaningful picture.

  • The CNN team students focused on cleaning and refining local details first.
  • The Transformer team students worked on the overall structure early on.

In the end, all drawings matched the description of the teacher. But because the students had learned about the objects through dust-covered images - forcing them to build imaginative mental models - and started from different noisy beginnings, their final artworks showed greater diversity in style and composition compared to the earlier approach, where everyone relied on memorized templates.

Drawing process

The Magic Machine Analogy

In the world of modern AI, our story mirrors the inner workings of a magic machine, a generative model that can create new images from words. Here is how the pieces fit, which a more experienced reader might find analogous in the following way:

  • Learning Backbone → The Builders: The machine needs strong builders to understand patterns. These are like the CNN team (masters of local details) and the Transformer team (experts in global context). Together, they form the backbone that learns how images are structured.
  • Language Alignment → The Teacher’s Instructions: Like the teacher gave clear directions—“Bluey sitting in a spaceship near the moon”, the magic machine uses language-image alignment (like CLIP) to connect what we say with what it draws.
  • Generative Power → The Creative Drawing: When the machine starts creating, it’s like the students drawing the scene from the teacher’s words. This is the essence of Generative AI - turning text into pictures.
  • Mode Collapse → Everyone Drawing the Same Thing: Remember how the final drawings of all the students looked almost identical? That’s like mode collapse, where the model produces similar outputs instead of diverse ones.
  • DDPM Strategy → The Dusty Trick: To avoid memorization and encourage creativity, the teacher sprinkled magic dust on the paper, making students start from random scribbles and refine step by step. This is exactly what DDPM does.

Magic machine

Conclusion

In the end, the classroom experiment revealed something profound: Learning to create is not just about memorizing what you have seen, it is about building meaning from uncertainty. The CNN team mastered local details, the Transformer team understood global context, and the teacher’s instructions acted as a bridge between language and vision.

But the real magic happened when the teacher introduced the dusty settings. By starting from noisy, chaotic beginnings and refining step by step, the students learned to construct the picture from scratch, guided only by the story. This mirrors how modern generative models work:

  • Backbones like CNNs and Transformers provide the foundation.
  • Language alignment (such as CLIP) connects words to images.
  • Diffusion strategies start from noise and iteratively de-noise, ensuring diversity and creativity while staying true to the prompt.

Just like the final drawings of the students, these models transform randomness into meaning - turning words into pictures. “From dust to detail, generative models turn random chaos into meaningful representation - guided one step at a time.”

Glossary

TermMeaning
AIArtificial Intelligence
CNNConvolutional Neural Network
CLIPContrastive Language Image Pretraining
DDPMDenoising Diffusion Probabilistic Model
DiffusionThe process of adding and removing noise to a picture
Language AlignmentTurn text into pictures
Mode CollapseWhen a computer generates the same picture repeatedly
TransformersNeural network architecture for context understanding