Introduction

This project dives into the world of diffusion models, exploring how they work and what makes them powerful for generating and editing images. Starting with the basics, we investigated how noise can be added and removed from images, then moved on to more advanced techniques like time-conditioned architectures and class conditioning. Along the way, we applied these ideas to novel tasks like inpainting, image-to-image translation, even creating visual anagrams, and hybrid images.

Set Up

The project utilizes DeepFloyd IF, a two-stage text-to-image diffusion model which generates images from text prompts. We give the model a test run with some prompts before exploring and experimentating with various capabilities like testing different inference configurations, inpainting, and creating optical illusions.

Denoising

Implementing the Forward Process

The forward process simulates how clean images are gradually destroyed by adding noise. This is the foundation of diffusion models. This process is defined by:

$$q(x_t | x_0) = N(x_t ; \sqrt{\bar\alpha_t} x_0, (1 - \bar\alpha_t)\mathbf{I})$$

which is implemented using the following equation:

$$x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1 - \bar\alpha_t} \epsilon \quad \text{where}~ \epsilon \sim N(0, 1)$$

Here, the noise level is controlled by the parameter $\bar\alpha_t$. We implement this forward() function which computes noisy versions of images at different timesteps. We can see the output of the forward process in the images below, using a test image of the Berkeley Campanile at $t=250, 500$, and $750$.

Original Campanile Image

Original test image (64x64 Campanile)

Classical Denoising

We try to see if we can remove the noise we added to these image using classical methods. Typically, noise (high frequency data) is removed with a low-pass filter. Using Gaussian blur filtering with various kernel sizes, we attempted to denoise images corrupted at different timesteps ($t=250, 500, 750$). We see that this is impossible.

One-Step Denoising

Instead of trying to solve this classically, we now try one-step denoising. One-step denoising leverages a pretrained UNet from the DeepFloyd model to estimate and remove all the noise in an image at once. The process involves passing noisy images through the UNet, which predicts the noise component, allowing us to recover an approximation of the original clean image. The results here are much better than the Gaussian blurring. But can still be improved.

Iterative Denoising

Iterative denoising improves upon one-step denoising by progressively reducing noise across multiple timesteps. The process follows the equation:

$$x_{t'} = \frac{\sqrt{\bar\alpha_{t'}}\beta_t}{1 - \bar\alpha_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar\alpha_{t'})}{1 - \bar\alpha_t} x_t + v_\sigma$$

Using a custom schedule of "strided timesteps" from $t=990$ to $0$, we can gradually denoise the images, maintaining the overall sturucture of the image. This approach proved substantially more effective than both one-step denoising and Gaussian blurring.

Final Results

Generation

Diffusion Model Sampling

Diffusion model sampling generates entirely new images by starting with pure Gaussian noise and iteratively denoising it using text prompts. The process begins with random noise and applies the iterative denoising function while conditioning on text embeddings like "a high quality photo." While the generated images show some plausible features, they lack coherence and need some additional guidance.

Classifier-Free Guidance

CFG enhances image generation by combining conditional and unconditional noise estimates with a scaling factor $\gamma$. The combination is performed according to:

$$\epsilon = \epsilon_u + \gamma (\epsilon_c - \epsilon_u)$$

where $\epsilon_u$ and $\epsilon_c$ are the unconditional and conditional noise estimates respectively. This technique improves output quality by pushing the model to align closer with given prompts, requiring two UNet passes at each timestep. Results show images with sharper details and greater coherence.

Image-to-Image Translation

This section explores diffusion models by seeing how they can modify existing images. By introducing controlled noise to an image and then denoising it back iteratively, the model generates variations of the original image. As the noise level increases, the edits become more significant, enabling both subtle adjustments and creative transformations. This method is rooted in the SDEdit algorithm, which forces a noisy image onto the manifold of natural images.

Original Campanile Image

Input Image: Campanile

Original Sunflower Image

Input Image: Sunflower

Original Calypso Image

Input Image: Calypso

Editing Hand-Drawn and Web Images

Here we explore the model working with non-photorealistic inputs. By applying various noise levels to hand-drawn sketches and pictures off the internet, then using CFG-guided denoising, we can see how the model can maintain structural elements while adding realistic details, textures, and lighting.

Original web image of a sad cat meme

Web Image Input: Sad Cat

Hand-drawn sketch of a strawberry

Hand-Drawn Input: Strawberry

Hand-drawn sketch of a log cabin

Hand-Drawn Input: Log Cabin

In Painting

Inpainting is a technique that enables selective image region editing through masked diffusion. The algorithm applies a binary mask during the denoising process: $$x_t \leftarrow \textbf{m} x_t + (1 - \textbf{m})\text{forward}(x_{orig}, t)$$ At each timestep, the implementation maintains original content in unmasked regions while allowing masked areas to undergo diffusion. The forward diffusion matches noise levels at each step $t$, ensuring consistency across masked boundaries.

Text-Conditional Image-to-image Translation

Text-conditional translation combines iterative denoising with textual prompts to guide image reconstruction. By conditioning the denoising process on descriptive prompts and adjusting CFG scales, we get controlled transformations that balances prompt faithfulness and original image faithfulness.



Visual Anagrams

Visual anagrams explores the creation of dual-perception images. The core algorithm combines two distinct noise estimates: $$\epsilon_1 = \text{UNet}(x_t, t, p_1)$$ $$\epsilon_2 = \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2))$$ $$\epsilon = (\epsilon_1 + \epsilon_2) / 2$$ The implementation processes each timestep by generating noise estimates for both orientations simultaneously (right side up and upside down). For the upright orientation, the UNet processes the image normally with a prompt ($p_1$). At the same time, the flipped orientation processes the image with a prompt ($p_2$). The noise estimates are then combined through averaging, creating a single image with both interpretations.

Hybrid Images

Hybrid images uses factorized diffusion to create cool optical illusions. Using frequency-based combination of noise estimate $$\epsilon_1 = \text{UNet}(x_t, t, p_1)$$ $$\epsilon_2 = \text{UNet}(x_t, t, p_2)$$ $$\epsilon = f_\text{lowpass}(\epsilon_1) + f_\text{highpass}(\epsilon_2)$$ We process each denoising step by generating separate noise estimates from two different prompts, then combining them through frequency filtering. A Gaussian blur with kernel size $33$ and $\sigma =2$ was employed for the low-pass filter ($f_\text{lowpass}$), with the high-pass filter ($f_\text{highpass}$) implemented as the residual from the low-pass operation. This results in some cool images with different appearances depending on viewing distance.

From Scratch

Introduction

Now we implement diffusion models from scratch. Using the MNIST dataset, we'll progress from basic denoising to a complete diffusion model implementation. The project evolved through three main phases: basic denoising, time-conditioned diffusion, and class-conditioned generation.

UNet Architecture Development

First and foremost, the UNet architecture forms the foundation of the denoising system. The architecture can be broken up into a few key components: Conv2d and BatchNorm2d, followed by DownConv and UpConv.

Hand-drawn sketch of a log cabin

Noising Process

Denoiser Training and Testing

Training utilized the MNIST dataset. Initial experiments The training process, executed over five epochs, produced a model capable of effective noise reduction, though with some limitations in fine detail preservation.

Testing across varying noise levels $(\sigma = \[0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0 \])$ showed important details about model robustness. The denoiser exhibited optimal performance at its trained noise level $(σ = 0.5)$ but showed decreased effectiveness at both lower $(σ = 0.1)$ and higher $(σ = 0.9)$ noise levels.

Hand-drawn sketch of a log cabin

Noising Process

Hand-drawn sketch of a log cabin

Loss

Adding Time Conditioning to UNet

As we we saw before, adding time conditioning results in significantly improved model performance. By implementing FCBlocks, we were able to integrate timestep information throughout the network. This upgrade took our basic denoiser to a time-aware system capable of progressive noise reduction.

Training and Sampling the Time-Conditioned Model

Training the complete diffusion model introduced additional complexity However. The process required some careful thought for timestep sampling and corresponding noise generation. Training for 20 epochs now though, we were able to see the model converge. The process of converting random noise into coherent digit representations still left alot to be desired. There was still class conditioning for us to implement to improve our model further.

Hand-drawn sketch of a log cabin

Epoch 5

Class-Conditioned Generation

The implementation of class conditioning introduced an additional dimension of control to the diffusion model. The architecture was extended through the integration of additional FCBlocks specifically designed for class information processing. A dropout mechanism with 10% probability was implemented to enable both conditional and unconditional generation capabilities. While still not perfect, the implementation had a high success rate of generating multiple instances of each digit while maintatining high visual clarity and structure.

Hand-drawn sketch of a log cabin

Epoch 5

Conclusion

Through this project, we gained a solid understanding of diffusion models, from their foundational processes to more advanced applications. We learned what makes these models effective, like how to tune their noise levels, choose the right architectures, and balance trade-offs in generation quality. While there’s still room to improve their efficiency and robustness, this project has shown how much potential diffusion models have for creative and technical tasks. Overall, it’s been a really neat chance to finally dip my toes into the world of generative AI.