Sandipan Das / Roboticist and ML Engineer

Essential NN modules

2026-02-18T00:00:00-08:00

Generative AI for all: Video Diffusion models

2025-09-21T00:00:00-07:00

Introduction

Have you ever played with a flipbook? You draw a picture on each page, and when you flip the pages really fast, the pictures move like a cartoon. Fun, right?

Now, let’s make it magical! Imagine a magic machine that doesn’t just draw one picture for you - it creates a whole flip book! And not just any flip book… this one makes the pictures move smoothly, like real life. Why is that important? Because a video is nothing more than a bunch of pictures shown one after another. If the pictures change too much from one page to the next, the motion looks jumpy and weird. But if each picture changes just a little, the motion feels natural and smooth.

That’s what video diffusion models do! They work super hard to make sure every picture in the flip book connects nicely to the next one, so your story flows like magic.

On learning of magical machines

Let us continue from our classroom story. The teacher comes back to class and says,

“Guess what, kids? Today, we’re not just drawing one picture - we’re making a whole magic flip book. But, first let us see some before you get to make your own!”

She brings out some flip books filled with clear and neat pictures in a sequence. But then she remembers her last fun experiment and decides to spice things up! Instead of showing the perfect flip books, she grabs a bag of magic dust and sprinkles it all over the pages, covering the pictures in funny little dots.

The students giggle and look closely at the different flip books, trying to spot the objects of interest hidden in the noisy, dusty pictures. Like previously, the teacher then divided the class into two teams and announced that each team would learn about the pictures in a different way.

Team CNN: The Keyhole Explorers

The teacher gathered the students around and revealed four image frames of Bluey running joyfully along the beach where each frame captured a different moment.

To make the learning process magical, the teacher appointed a team lead and said:

“The team lead’s special job will be to decide the order in which the students would stand. But here’s the twist: each time, the order will be completely random!”

The adventure began.

First round: The team lead called out a random order. Each student, clutching their magic cardboard keyhole, stepped up to peek at a tiny part of one frame. Maybe one saw Bluey’s ear, another the sparkling sand, another a patch of sky and verified their understanding from the teacher. Their task was to study the fine details of their assigned part, memorizing every curve and color.
Next round: The team lead shuffled the order again. Now, each student peered through their keyhole at a different frame and a different part. The details changed: sometimes it was Bluey’s tail, sometimes a seashell, sometimes the foamy edge of a wave. Each student compared what they saw with what they’d learned before, noticing how the same object could look different in another frame.
And Again: The process repeated, with the team lead inventing new orders each time. Students swapped frames and parts, learning about every detail from every possible perspective. The classroom buzzed with excitement as discoveries piled up.
Exhausting All Orders: The team lead kept going until every possible order had been tried. By the end, each student had seen every part of every frame, but always through their tiny keyhole - never the whole picture at once.

The students became experts at recognizing tiny details, no matter where they appeared or in which frame. They were like detectives, piecing together the story from small clues. Even though they never saw the whole image at once, their combined knowledge helped them understand the entire sequence of Bluey’s run.

Team Transformers: The Puzzle Patchers

While Team CNN explored the flip book frame by frame, for Team Transformers the teacher had a different approach. She said:

“You will not have any team lead. But I want all of you to talk to each other to figure out the details.”

Instead of looking at one image at a time, the teacher gave them a giant board filled with patches from all the images in the video flip book - like a huge jigsaw puzzle scattered across time. Each student received several puzzle patches, but these patches weren’t just from one frame - they came from different moments in the story. Some showed Bluey’s tail in the first frame, others the waves in the middle, and some the sky in the last frame. The challenge was to figure out how all these pieces fit together, not just within a single image, but across the entire sequence. The teacher encouraged the students to talk, swap patches, and look for patterns that connected the beginning, middle, and end of the video.

“Notice how Bluey’s tail moves from left to right across the frames,” she said. “Or how the color of the sky changes as the story unfolds. Your job is to connect these clues and build the whole story in your minds.”

As the students worked together, they realized something magical: by seeing patches from all the images at once, they could understand how every part of the video was related. They spotted patterns that stretched across time - how the sand sparkled in every frame, how Bluey’s run became faster, and how the waves rolled in and out.

Unlike keyhole explorers, puzzle patchers did not just learn about one image at a time. They learned about the whole video - how every patch, every detail, and every moment fit together to create a smooth, flowing story. the teacher continued the experiment with the other flip books she had.

The magic movie challenge

The teacher clapped her hands and wrote today’s challenge on the board, in big, shimmering chalk: Bluey walking on the moon after coming out of the spaceship.

“Class,” she said, “this is our story prompt. Every frame you draw should follow this idea. Bluey steps out, touches moon dust, takes a few bouncy steps, and the spaceship glows behind. We want the whole flipbook to feel like one smooth scene.”

She placed blank pages on every desk and, as before, sprinkled a thin fog of magic dust over them. “We’ll clear the dust little by little,” she smiled. “Each pass makes the pictures cleaner. After every pass I’ll check how well your drawings match the story and give you feedback so you can improve.”

“Small steps, steady changes,” she repeated. “Let the moon feel light beneath Bluey’s feet.”

Team CNN: The Keyhole Explorers - The Team Lead stood up and said, “Let’s draw our story one frame at a time! I’ll tell you who goes first.”

First round: The first student drew Bluey stepping out of the spaceship. The next student drew Bluey taking a small step onto the moon. The third added Bluey’s paw touching the moon dust. Each student focused on their own frame, making sure to change just a little bit from the last one—like flipping through a cartoon book.
Teacher’s feedback: The teacher looked at all the pictures and said, “Nice job! But Bluey’s steps are too big, and the stars moved. Next time, make smaller changes and keep the background the same.”
Next rounds: The Team Lead switched up who drew which frame. Some students worked on footprints, others on the spaceship. With each round, the drawings got smoother and more connected. The team learned to copy a little from the frame before and make tiny changes, so the story looked just right.

Team Transformers: The Puzzle Patchers - The teacher smiled and said, “You don’t need a team lead. Instead, talk to each other and plan together!”

First round: The students gathered around a big board showing all the frames at once. They chatted and decided who would draw which part of the story. “I’ll draw Bluey stepping out of the spaceship!” “I’ll add the moon dust and footprints!” “I’ll make sure the stars and spaceship look the same in every frame!” They worked together, sharing ideas and making sure every frame fit perfectly with the others - like putting together a giant puzzle.
Teacher’s feedback: The teacher looked at their drawings and said, “Great teamwork! But Bluey’s footprints are too close together, and the spaceship glow changes. Next time, keep things steady and spread out the footprints.”
Next rounds: The students talked even more, fixing their pictures and helping each other. They made sure Bluey’s walk looked smooth and the moon scene stayed the same. By the end, their flip book told the story just right—everyone’s drawings matched up, and Bluey’s adventure flowed from start to finish!

In the end, both teams made something really beautiful out of the dusty mess, and the pictures looked almost perfect! Just like you can see in Figure 6, Team Transformer did a great job showing exactly what the teacher asked for - Bluey walking on the moon after coming out of the spaceship

The Magic Machine Analogy

In the world of modern AI, our story mirrors the inner workings of a magic machine, a generative model that can turn words into moving pictures. Here is how the pieces fit, which a more experienced reader might find analogous in the following way:

Learning Backbone → The Builders: The machine needs strong builders to understand patterns. These are like the CNN team (masters of local details) and the Transformer team (experts in global context). Together, they form the backbone that learns how images are structured. In contrast to the diffusion models in this magical machine, there’s a special helper for the CNN team: the class monitor, a wise and watchful figure who ensures every student’s work fits together. Whenever a student forgets a detail from earlier, the monitor whispers a reminder, helping everyone remember what came before and making sure the story stays consistent from start to finish. This is analogous to skip connection mechanism. For the transformer team they can talk with each other to figure out the flow.
Language Alignment → The Teacher’s Instructions: Like the teacher gave clear directions—“Bluey walking on the moon after coming out of the spaceship”, the magic machine uses language-image alignment (like CLIP) to connect what we say with what it draws.
Generative Power → The Creative Drawing: When the machine starts creating, it’s like the students drawing the scene from the teacher’s words. This is the essence of Generative AI - turning text into sequential pictures.
Mode Collapse → Everyone Drawing the Same Thing: To avoid everyone’s drawing looking similar, the teacher adds the dust concept. That’s like mode collapse, where the model produces similar outputs instead of diverse ones, which can be avoided by adding random noise in the learning process.

Conclusion

In the end, the classroom experiment showed something amazing: making pictures and movies isn’t just about copying what you see - it’s about building something new from a messy start! The team CNN became experts at tiny things, the Team transformers learned how everything fits together, and the teacher’s instructions helped everyone turn words into pictures. The students learned to construct the picture from scratch, guided only by the story. This mirrors how modern generative models work:

Backbones like CNNs and Transformers provide the foundation.
Language alignment (such as CLIP) connects words to images.
Diffusion strategies start from noise and iteratively de-noise, ensuring diversity and creativity while staying true to the prompt.
A global context for frame alignment is adopted in a smart way.

Just like the students’ final flip books, these magical models turn random chaos into wonderful movies—one step at a time, guided by the story.
“From dust to detail, video diffusion models create magic, turning words into moving pictures!”

Glossary

Term	Meaning
AI	Artificial Intelligence
CNN	Convolutional Neural Network
CLIP	Contrastive Language Image Pretraining
Diffusion	The process of adding and removal noise to a picture
Factorized Diffusion	The process of diffusion in steps
Language Alignment	Turn text into pictures
Mode Collapse	When a computer generates the same picture over and over
Transformers	Neural network architecture for context understanding

Generative AI for all: Diffusion models

2025-09-14T00:00:00-07:00

Introduction

Imagine you have a beautiful drawing of a Bluey — a playful six-year-old Blue Heeler pup. Now, imagine you slowly sprinkle tiny dots of dust all over it, little by little, until the whole picture turns into a messy cloud of dust. That’s called diffusion — the process of turning a clear image into a noisy one.

Now here’s the cool part: what if you had a magical machine that could look at that cloud of dust and slowly clean it up, step by step, until the Bluey drawing comes back? That’s what a diffusion model does! Guess what? These magic models can even make pictures just by listening to words! But shhh… let’s keep that secret in our pocket for later. But how can this magical machine make even more pictures of Bluey that I didn’t show it? If you’re still curious, keep reading—it’s a pretty cool story!

On learning of magical machines

Imagine a lively classroom filled with curious kids. One day, the teacher walks in with a big smile and announces a challenge:

Today, I want you to draw something amazing: “Bluey sitting in a spaceship flying near the moon.”

The kids cheer — but the teacher has a twist. Before they start drawing, the teacher brings in lots of pictures of Bluey, spaceships, and moons. The teacher then divided the class into two teams and announces that each team will learn about the pictures in a different way.

Team CNN: The Keyhole Explorers

For the first team, the teacher had a quirky idea. She handed each student a tiny cardboard keyhole and said,

You’ll look at the pictures one tiny piece at a time — like peeking through a secret portal!

So the students lined up, one by one, peeking through their keyholes at pictures of Bluey, spaceships, and moon. One student saw a fuzzy curve. “Hmm… that looks like an ear!” the teacher whispered. Another spotted a shiny patch. “That’s probably the spaceship’s window,” she said. Every time they peeked, the teacher gave them clues — helping them understand what that little piece might be. Over time, the students became mini detectives, learning to recognize each part of the image from just a glimpse. They didn’t see the whole picture at once, but they got really good at figuring out the details. They became puzzle solvers who knew each piece by heart — even if they never saw the full puzzle all at once.

Team Transformers: The Puzzle Patchers

For the second team, the teacher had a totally different idea — and it felt like a game of group detective work. She snipped all the pictures of Bluey, spaceships, and moons, into small square patches and handed one to each student. Then she said with a grin:

Your patch is just one tiny piece of the puzzle. Talk to your friends, figure out what your patch might be, and where it fits. Then tell me what you think — and I’ll help you get it right!

So the classroom buzzed with excitement. One student shouted, “Mine looks like part of an eye!” Another replied, “Yours? That might be the moon’s surface!” They huddled together, comparing patches, swapping ideas, and slowly piecing together the big picture — like assembling a giant jigsaw puzzle without the box cover. Every time they thought they had it figured out, they’d run to the teacher, who’d give them feedback:

Hmm… close! But that patch belongs to the spaceship’s wing, not Bluey’s tail.

Back they’d go, chatting and adjusting, learning not just from their own patch but from everyone else’s. They weren’t just solving their own piece — they were learning how all the pieces fit together.

The magic drawing challenge

Once the teacher finished her classes with both the teams, the teacher clapped her hands for everyone’s attention.

“Now,” she announced, “it’s time for The Magic Drawing Challenge!”

She didn’t show any new picture. Instead, she re-read the challenge aloud again.

“Bluey is sitting in a spaceship flying near the moon.”
“Your job,” she said, “is to turn my words into a beautiful drawing.”

The room went silent for a heartbeat—and then pencils and colours began to dance. The CNN team drew excellent details: the shiny panels of the spaceship, the soft curve of an ear, the pitted moon texture. The Transformer team sketched a coherent scene quickly: where Bluey sits, how the spaceship faces the moon, how everything fits together. The teacher kept circling, offering tiny improvements like:

“Make Bluey sit inside the spaceship.”
“Bring the moon closer so it’s clearly near.”
“Let’s show the spaceship actually flying — add stars, a glow, a trail.”

With each small suggestion, the students’ drawings moved closer and closer to the teacher’s instruction. But something interesting happened: by the end, the final drawings from all students in each team looked strikingly similar (as shown in the figure below).

Team CNN: Their drawings had excellent local details — the spaceship panels were sharp, the moon’s craters were textured, and Bluey’s fur looked realistic. But the overall composition often felt fragmented, as if the parts didn’t fully belong together, like Bluey sitting over the spaceship, instead of sitting inside.
Team Transformers: Their drawings captured the global layout well — Bluey inside the spaceship, the moon in the background, and a sense of motion. However, some fine details were missing or simplified, making the image less rich in texture.

Despite these differences, there was hardly any innovation. Why? Because all students started from similar mental templates of the original pictures they had memorized. When asked to draw from the instruction, they simply reassembled what they already knew, leading to convergent, almost identical outputs for each team, which got refined step by step under the teacher’s guidance.

The dusty trick

Although the results from both the teams look good, the teacher was mildly disappointed as she didn’t get to see the creative part of all the students. So, she re-planned her teaching strategy for the two groups. This time, she didn’t just show them clean pictures of Bluey, spaceships, and the moon. Instead, she sprinkled magic dust over the images before showing them. This dust made the pictures look blurry and speckled, so the students had to guess and reason about what they were seeing.

For the CNN team, like earlier, the teacher revealed the dusty images through a cardboard keyhole, so they learned to recognize local features even in noisy conditions - like spotting an ear or a wing despite the blur.
For the Transformer team, the teacher gave them dusty patches and asked them to talk to each other to figure out what each patch might represent. This taught them to share context and handle uncertainty together.

Later, when the teacher gave the final magic drawing challenge again - she didn’t let them start on a clean sheet. Instead, she covered their papers with random dusty patterns and said:

“Now I’ll tell you a secret. Before you even start drawing, your brain has learned a special sketchbook from the dusty examples which you have seen - a latent place where you imagine things before putting them on paper. It’s like a magical notebook where ideas live as fuzzy shapes and feelings, waiting to become real pictures.”

The students gasped. “So we will draw from our imagination?”

“Exactly!” said the teacher. “And here’s another trick — the dust I gave you wasn’t random. I used a dusty clock — a timer that decides how much dust to add or remove at each step. At first, it’s super dusty, and then it gets cleaner and cleaner. That’s called a noise schedule. It helps you slowly uncover your sketch, with my feedback, one layer at a time.”

The students nodded, imagining their sketchbooks filled with swirling clouds of ideas and a magical clock ticking as they cleaned and created. Each student began with a different dust pattern, so their starting points were unique. With every round of feedback, they removed some dust and added clearer details, gradually transforming chaos into a meaningful picture.

The CNN team students focused on cleaning and refining local details first.
The Transformer team students worked on the overall structure early on.

In the end, all drawings matched the description of the teacher. But because the students had learned about the objects through dust-covered images - forcing them to build imaginative mental models - and started from different noisy beginnings, their final artworks showed greater diversity in style and composition compared to the earlier approach, where everyone relied on memorized templates.

The Magic Machine Analogy

In the world of modern AI, our story mirrors the inner workings of a magic machine, a generative model that can create new images from words. Here is how the pieces fit, which a more experienced reader might find analogous in the following way:

Learning Backbone → The Builders: The machine needs strong builders to understand patterns. These are like the CNN team (masters of local details) and the Transformer team (experts in global context). Together, they form the backbone that learns how images are structured.
Language Alignment → The Teacher’s Instructions: Like the teacher gave clear directions—“Bluey sitting in a spaceship near the moon”, the magic machine uses language-image alignment (like CLIP) to connect what we say with what it draws.
Generative Power → The Creative Drawing: When the machine starts creating, it’s like the students drawing the scene from the teacher’s words. This is the essence of Generative AI - turning text into pictures.
Mode Collapse → Everyone Drawing the Same Thing: Remember how the final drawings of all the students looked almost identical? That’s like mode collapse, where the model produces similar outputs instead of diverse ones.
DDPM Strategy → The Dusty Trick: To avoid memorization and encourage creativity, the teacher sprinkled magic dust on the paper, making students start from random scribbles and refine step by step. This is exactly what DDPM does.

Conclusion

In the end, the classroom experiment revealed something profound: Learning to create is not just about memorizing what you have seen, it is about building meaning from uncertainty. The CNN team mastered local details, the Transformer team understood global context, and the teacher’s instructions acted as a bridge between language and vision.

But the real magic happened when the teacher introduced the dusty settings. By starting from noisy, chaotic beginnings and refining step by step, the students learned to construct the picture from scratch, guided only by the story. This mirrors how modern generative models work:

Backbones like CNNs and Transformers provide the foundation.
Language alignment (such as CLIP) connects words to images.
Diffusion strategies start from noise and iteratively de-noise, ensuring diversity and creativity while staying true to the prompt.

Just like the final drawings of the students, these models transform randomness into meaning - turning words into pictures. “From dust to detail, generative models turn random chaos into meaningful representation - guided one step at a time.”

Glossary

Term	Meaning
AI	Artificial Intelligence
CNN	Convolutional Neural Network
CLIP	Contrastive Language Image Pretraining
DDPM	Denoising Diffusion Probabilistic Model
Diffusion	The process of adding and removing noise to a picture
Language Alignment	Turn text into pictures
Mode Collapse	When a computer generates the same picture repeatedly
Transformers	Neural network architecture for context understanding