Fun With Diffusion Models

Introduction

This project dives into the world of diffusion models, exploring how they work and what makes them powerful for generating and editing images. Starting with the basics, we investigated how noise can be added and removed from images, then moved on to more advanced techniques like time-conditioned architectures and class conditioning. Along the way, we applied these ideas to novel tasks like inpainting, image-to-image translation, even creating visual anagrams, and hybrid images.

Set Up

The project utilizes DeepFloyd IF, a two-stage text-to-image diffusion model which generates images from text prompts. We give the model a test run with some prompts before exploring and experimentating with various capabilities like testing different inference configurations, inpainting, and creating optical illusions.

A man wearing a hat
(5 inference steps)

A rocketship
(5 inference steps)

An oil painting of a snowy mountain village
(5 inference steps)

A man wearing a hat
(20 inference steps)

A rocketship
(20 inference steps)

An oil painting of a snowy mountain village
(20 inference steps)

A man wearing a hat
(200 inference steps)

A rocketship
(200 inference steps)

An oil painting of a snowy mountain village
(200 inference steps)

Denoising

Implementing the Forward Process

The forward process simulates how clean images are gradually destroyed by adding noise. This is the foundation of diffusion models. This process is defined by:

$$q(x_t | x_0) = N(x_t ; \sqrt{\bar\alpha_t} x_0, (1 - \bar\alpha_t)\mathbf{I})$$

which is implemented using the following equation:

$$x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1 - \bar\alpha_t} \epsilon \quad \text{where}~ \epsilon \sim N(0, 1)$$

Here, the noise level is controlled by the parameter $\bar\alpha_t$. We implement this forward() function which computes noisy versions of images at different timesteps. We can see the output of the forward process in the images below, using a test image of the Berkeley Campanile at $t=250, 500$, and $750$.

Original test image (64x64 Campanile)

Noisy version at $t=250$

Noisy version at $t=500$

Noisy version at $t=750$

Classical Denoising

We try to see if we can remove the noise we added to these image using classical methods. Typically, noise (high frequency data) is removed with a low-pass filter. Using Gaussian blur filtering with various kernel sizes, we attempted to denoise images corrupted at different timesteps ($t=250, 500, 750$). We see that this is impossible.

Noisy image at $t=250$

Noisy image at $t=500$

Noisy image at $t=750$

Gaussian-denoised version ($t=250, \sigma=2$, kernel size=5)

Gaussian-denoised version ($t=500, \sigma=2$, kernel size=5)

Gaussian-denoised version ($t=750, \sigma=2$, kernel size=5)

One-Step Denoising

Instead of trying to solve this classically, we now try one-step denoising. One-step denoising leverages a pretrained UNet from the DeepFloyd model to estimate and remove all the noise in an image at once. The process involves passing noisy images through the UNet, which predicts the noise component, allowing us to recover an approximation of the original clean image. The results here are much better than the Gaussian blurring. But can still be improved.

Original image ($t=250$)

Noisy version ($t=250$)

UNet denoised version ($t=250$)

Original image ($t=500$)

Noisy version ($t=500$)

UNet denoised version ($t=500$)

Original image ($t=750$)

Noisy version ($t=750$)

UNet denoised version ($t=750$)

Iterative Denoising

Iterative denoising improves upon one-step denoising by progressively reducing noise across multiple timesteps. The process follows the equation:

$$x_{t'} = \frac{\sqrt{\bar\alpha_{t'}}\beta_t}{1 - \bar\alpha_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar\alpha_{t'})}{1 - \bar\alpha_t} x_t + v_\sigma$$

Using a custom schedule of "strided timesteps" from $t=990$ to $0$, we can gradually denoise the images, maintaining the overall sturucture of the image. This approach proved substantially more effective than both one-step denoising and Gaussian blurring.

Denoising process - Step 690

Denoising process - Step 540

Denoising process - Step 390

Denoising process - Step 240

Denoising process - Step 90

Denoising process - Complete

Final Results

Gaussian blurred result

One-step denoised result

Iteratively denoised result

Original image

Generation

Diffusion Model Sampling

Diffusion model sampling generates entirely new images by starting with pure Gaussian noise and iteratively denoising it using text prompts. The process begins with random noise and applies the iterative denoising function while conditioning on text embeddings like "a high quality photo." While the generated images show some plausible features, they lack coherence and need some additional guidance.

Classifier-Free Guidance

CFG enhances image generation by combining conditional and unconditional noise estimates with a scaling factor $\gamma$. The combination is performed according to:

$$\epsilon = \epsilon_u + \gamma (\epsilon_c - \epsilon_u)$$

where $\epsilon_u$ and $\epsilon_c$ are the unconditional and conditional noise estimates respectively. This technique improves output quality by pushing the model to align closer with given prompts, requiring two UNet passes at each timestep. Results show images with sharper details and greater coherence.

Image-to-Image Translation

This section explores diffusion models by seeing how they can modify existing images. By introducing controlled noise to an image and then denoising it back iteratively, the model generates variations of the original image. As the noise level increases, the edits become more significant, enabling both subtle adjustments and creative transformations. This method is rooted in the SDEdit algorithm, which forces a noisy image onto the manifold of natural images.

Input Image: Campanile

Initial noise level

Translation step 3

Translation step 5

Translation step 7

Translation step 10

Translation step 20

Input Image: Sunflower

Initial noise level

Translation step 3

Translation step 5

Translation step 7

Translation step 10

Translation step 20

Input Image: Calypso

Initial noise level

Translation step 3

Translation step 5

Translation step 7

Translation step 10

Translation step 20

Editing Hand-Drawn and Web Images

Here we explore the model working with non-photorealistic inputs. By applying various noise levels to hand-drawn sketches and pictures off the internet, then using CFG-guided denoising, we can see how the model can maintain structural elements while adding realistic details, textures, and lighting.

Web Image Input: Sad Cat

Noise Level 1

Noise Level 3

Noise Level 5

Noise Level 7

Noise Level 10

Noise Level 20

Hand-Drawn Input: Strawberry

Noise Level 1

Noise Level 3

Noise Level 5

Noise Level 7

Noise Level 10

Noise Level 20

Hand-Drawn Input: Log Cabin

Noise Level 1

Noise Level 3

Noise Level 5

Noise Level 7

Noise Level 10

Noise Level 20

In Painting

Inpainting is a technique that enables selective image region editing through masked diffusion. The algorithm applies a binary mask during the denoising process: $$x_t \leftarrow \textbf{m} x_t + (1 - \textbf{m})\text{forward}(x_{orig}, t)$$ At each timestep, the implementation maintains original content in unmasked regions while allowing masked areas to undergo diffusion. The forward diffusion matches noise levels at each step $t$, ensuring consistency across masked boundaries.

Input Image: Campanile

Area To Inpaint

Inpainted Image

Text-Conditional Image-to-image Translation

Text-conditional translation combines iterative denoising with textual prompts to guide image reconstruction. By conditioning the denoising process on descriptive prompts and adjusting CFG scales, we get controlled transformations that balances prompt faithfulness and original image faithfulness.

Noise Level 1

Noise Level 3

Noise Level 5

Noise Level 7

Noise Level 10

Noise Level 20

Noise Level 1

Noise Level 3

Noise Level 5

Noise Level 7

Noise Level 10

Noise Level 20

Noise Level 1

Noise Level 3

Noise Level 5

Noise Level 7

Noise Level 10

Noise Level 20

Visual Anagrams

Visual anagrams explores the creation of dual-perception images. The core algorithm combines two distinct noise estimates: $$\epsilon_1 = \text{UNet}(x_t, t, p_1)$$ $$\epsilon_2 = \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2))$$ $$\epsilon = (\epsilon_1 + \epsilon_2) / 2$$ The implementation processes each timestep by generating noise estimates for both orientations simultaneously (right side up and upside down). For the upright orientation, the UNet processes the image normally with a prompt ($p_1$). At the same time, the flipped orientation processes the image with a prompt ($p_2$). The noise estimates are then combined through averaging, creating a single image with both interpretations.

People around a campfire

People around a campfire again

A rocketship taking off

An old man

A skull

A dog

Hybrid Images

Hybrid images uses factorized diffusion to create cool optical illusions. Using frequency-based combination of noise estimate $$\epsilon_1 = \text{UNet}(x_t, t, p_1)$$ $$\epsilon_2 = \text{UNet}(x_t, t, p_2)$$ $$\epsilon = f_\text{lowpass}(\epsilon_1) + f_\text{highpass}(\epsilon_2)$$ We process each denoising step by generating separate noise estimates from two different prompts, then combining them through frequency filtering. A Gaussian blur with kernel size $33$ and $\sigma =2$ was employed for the low-pass filter ($f_\text{lowpass}$), with the high-pass filter ($f_\text{highpass}$) implemented as the residual from the low-pass operation. This results in some cool images with different appearances depending on viewing distance.

A dog

A waterfall

A hipster barista

An old man

A skull

A waterfall

From Scratch

Introduction

Now we implement diffusion models from scratch. Using the MNIST dataset, we'll progress from basic denoising to a complete diffusion model implementation. The project evolved through three main phases: basic denoising, time-conditioned diffusion, and class-conditioned generation.

UNet Architecture Development

First and foremost, the UNet architecture forms the foundation of the denoising system. The architecture can be broken up into a few key components: Conv2d and BatchNorm2d, followed by DownConv and UpConv.

Noising Process

Denoiser Training and Testing

Training utilized the MNIST dataset. Initial experiments The training process, executed over five epochs, produced a model capable of effective noise reduction, though with some limitations in fine detail preservation.

Testing across varying noise levels $(\sigma = \[0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0 \])$ showed important details about model robustness. The denoiser exhibited optimal performance at its trained noise level $(σ = 0.5)$ but showed decreased effectiveness at both lower $(σ = 0.1)$ and higher $(σ = 0.9)$ noise levels.

Noising Process

Training, $\sigma = 0.0$

Training, $\sigma = 0.2$

Training, $\sigma = 0.4$

Training, $\sigma = 0.5$

Training, $\sigma = 0.6$

Training, $\sigma = 0.8$

Training, $\sigma = 1.0$

Loss

Adding Time Conditioning to UNet

As we we saw before, adding time conditioning results in significantly improved model performance. By implementing FCBlocks, we were able to integrate timestep information throughout the network. This upgrade took our basic denoiser to a time-aware system capable of progressive noise reduction.

Training and Sampling the Time-Conditioned Model

Training the complete diffusion model introduced additional complexity However. The process required some careful thought for timestep sampling and corresponding noise generation. Training for 20 epochs now though, we were able to see the model converge. The process of converting random noise into coherent digit representations still left alot to be desired. There was still class conditioning for us to implement to improve our model further.

Epoch 5

Epoch 20

Loss

Class-Conditioned Generation

The implementation of class conditioning introduced an additional dimension of control to the diffusion model. The architecture was extended through the integration of additional FCBlocks specifically designed for class information processing. A dropout mechanism with 10% probability was implemented to enable both conditional and unconditional generation capabilities. While still not perfect, the implementation had a high success rate of generating multiple instances of each digit while maintatining high visual clarity and structure.

Epoch 5

Epoch 20

Loss

Conclusion

Through this project, we gained a solid understanding of diffusion models, from their foundational processes to more advanced applications. We learned what makes these models effective, like how to tune their noise levels, choose the right architectures, and balance trade-offs in generation quality. While there’s still room to improve their efficiency and robustness, this project has shown how much potential diffusion models have for creative and technical tasks. Overall, it’s been a really neat chance to finally dip my toes into the world of generative AI.

Fun W/ Diffusion Models

Introduction

Set Up

Denoising

Implementing the Forward Process

Classical Denoising

One-Step Denoising

Iterative Denoising

Final Results

Generation

Diffusion Model Sampling

Classifier-Free Guidance

Image-to-Image Translation

Editing Hand-Drawn and Web Images

In Painting

Text-Conditional Image-to-image Translation

Visual Anagrams

Hybrid Images

From Scratch

Introduction

UNet Architecture Development

Denoiser Training and Testing

Adding Time Conditioning to UNet

Training and Sampling the Time-Conditioned Model

Class-Conditioned Generation

Conclusion