CS 180 Project 5: Fun With Diffusion Models!

Overview

The project contains two parts that experiments with diffusion models. For the first part, I interact with a pretrain diffusion model DeepFloyd IF to perform several types of image generation. For the second part, I build and train my own diffusion model using torch.nn.Module with the MNIST dataset.

Part A: The Power of Diffusion Models

Setup

For the whole part, I use the pretrain DeepFloyd IF imported from Hugging Face. The model has two stages. The first one takes in noisy images of size $64\times 64$ and text embeddings to generate a denoised image. The second stage takes in the output of the first stage and generates images of size $256\times 256$.
In the forward process, as $t$ increases, the images will get noisier. And the backward process, which is what the denoiser do in stage one of the diffusion model is to estimate the noise in the image.

For the whole part, I use random seed $24$ for reproducibility purpose. I also use the text embeddings of "a high quality photo" as a default text embeddings if not mentioned specifically.
Here's some images generated from the model:

Inference steps = 5

an oil painting of a snowy mountain stage 1	a man wearing a hat stage 1	a rocket ship stage 1
an oil painting of a snowy mountain stage 2	a man wearing a hat stage 2	a rocket ship stage 2

Inference steps = 20

an oil painting of a snowy mountain stage 1	a man wearing a hat stage 1	a rocket ship stage 1
an oil painting of a snowy mountain stage 2	a man wearing a hat stage 2	a rocket ship stage 2

Forward Process

Forward process in diffusion models is to add noise to clean images. The forward process algorithm is defined by: $$q(x_t|x_0)=N(x_t;\sqrt{\bar{\alpha}}x_0, (1-\bar{\alpha}_t)I)$$ is equivalent to: $$x_t=\sqrt{\bar{\alpha}_t}x_0+\sqrt{1-\bar{\alpha}_t}\epsilon \text{ where }\epsilon\sim N(0,1)$$ $x_t$: noisy images
$x_0$: clean images
$\epsilon$: noise
$\bar{\alpha}_t$: alpha_cumprod, determined by the trainer of the model

Berkeley Campanile

Noisy Campanile at t=250

Noisy Campanile at t=500

Noisy Campanile at t=750

Classical Denoising

Calssically, I use gaussian blur filter to try to get rid of noise. But in this case this classical denoising does not work well.

Noisy Campanile at t=250	Noisy Campanile at t=500	Noisy Campanile at t=750
Gaussian Denoised Campanile at t=250	Gaussian Denoised Campanile at t=500	Gaussian Denoised Campanile at t=750

One-Step Denoising

One step denoising uses the pretrained diffusion model to denoise. The denoiser located at stage_1.unet. This denoiser estimates the noise in the noisy images given the timestep. Then remove the noise from noisy images can recover the estimate of original images.

Berkeley Campanile	Berkeley Campanile	Berkeley Campanile
Noisy Campanile at t=250	Noisy Campanile at t=500	Noisy Campanile at t=750
One-Step Denoised Campanile at t=250	One-Step Denoised Campanile at t=500	One-Step Denoised Campanile at t=750

Iterative Denoising

It's obvious that the Unet denoiser works much better than the Gaussian denoiser. But the result is still blurry as more noise added to the image. To make the performance even better, I implement the iterative denoising. In theory, the diffusion model allows me to iteratively denoising for 1000 timesteps. But to save time, I use a stride of 30 in total timestep of $T=1000$. I generated a list of times steps strided_timesteps, with values:

[990, 960, 930, 900, 870, 840, 810, 780, 750, 720, 690, 660, 630, 600, 570, 540, 510, 480, 450, 420, 390, 360, 330, 300, 270, 240, 210, 180, 150, 120, 90, 60, 30, 0]

The denoising algorithm on the ith step is: $$x_{t'}=\frac{\sqrt{\bar{\alpha}_{t'}}\beta_t}{1-\bar{\alpha}_t}x_0+\frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t'})}{1-\bar{\alpha}_t}x_t+v_\sigma$$ $t$: time at strided_timesteps[i]
$t'$: time at strided_timesteps[i+1]
$x_t$: image at timestep $t$
$x_{t'}$: image at timestep $t'$
$\bar{\alpha}_t$: alpha_cumprod
$\alpha_t$: $\frac{\bar{\alpha}_t}{\bar{\alpha}_{t'}}$
$\beta_t$: $1-\alpha_t$
$x_0$: the current estimate of clean image as in the one-step denoising
$v_\sigma$: random noise

Noisy Campanile at t=90

Noisy Campanile at t=240

Noisy Campanile at t=390

Noisy Campanile at t=540

Noisy Campanile at t=690

Noisy Campanile at t=840

Noisy Campanile at t=990

In this part, I use i_start = 10, which correspondence to timestep 690.

Berkeley
Campanile

Noisy Campanile
at t=690

Iterative Denoised Campanile

One-Step Denoised Campanile

Gaussian Denoised Campanile

Diffusion Model Sampling

By taking i_start = 0, the algorithm would denoise from pure noise and start generate images.

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Classifier-Free Guidance (CFG)

Some images generated in the previous section are not very good. To improve the quality of images, I use a technicque called Classifier-Free Guidance. In this technicque, the algorithm computes both a conditional and an unconditional noise estimate, then the new noise will be: $$\epsilon=\epsilon_u+\gamma(\epsilon_c+\epsilon_u)$$ By taking $\gamma>1$, we get a much higher quality images. For this and later sections, I use text embeddings "" as the unconditional prompt and "a high quality photo" as the conditional prompt.

Sample 1 with CFG

Sample 2 with CFG

Sample 3 with CFG

Sample 4 with CFG

Sample 5 with CFG

The results compare to the previous section is much more vivid and high-contrast.

Image-to-Image Translation

Image-to-image translation takes in a clean image, adds noise to it to a level, and then denoises it. This allows edits to existings images. The more noise it adds, the larger the edit will be.

Edit with t=960	Edit with t=900	Edit with t=840	Edit with t=780	Edit with t=690	Edit with t=390	Berkeley Campanile
Edit with t=960	Edit with t=900	Edit with t=840	Edit with t=780	Edit with t=690	Edit with t=390	Cat Meme
Edit with t=960	Edit with t=900	Edit with t=840	Edit with t=780	Edit with t=690	Edit with t=390	Sad Meme

Editing Hand-Drawn and Web Images

Except for taking in realistic images, the algorithm can also edit hand-drawn and web images.

Edit with t=960	Edit with t=900	Edit with t=840	Edit with t=780	Edit with t=690	Edit with t=390	Web Img
Edit with t=960	Edit with t=900	Edit with t=840	Edit with t=780	Edit with t=690	Edit with t=390	Hand-Drawn 1
Edit with t=960	Edit with t=900	Edit with t=840	Edit with t=780	Edit with t=690	Edit with t=390	Hand-Drawn 2

Inpainting

Given an image and a mask, I can also generate images that only the masked area changes while other area stays the same. In each loop, the new image will be: $$x_t\leftarrow mx_t+(1-m)\text{forward}(x_{orig},t)$$ $x_{orig}$: original image
$m$: binary mask

Campanile	Mask	Hole to Fill	Campanile Filled
Cat	Mask	Hole to Fill	Cat Filled
Sunset	Mask	Hole to Fill	Sunset Filled

Text-Conditional Image-toimage Translation

In this section, I experiments image editing by changing different text prompt from "a high quality photo".

prompt = "a rocket ship"

Edit
t=960

Edit
t=900

Edit
t=840

Edit
t=780

Edit
t=690

Edit
t=390

Campanile to Rocket Ship

prompt = "a photo of a man"

Edit
t=960

Edit
t=900

Edit
t=840

Edit
t=780

Edit
t=690

Edit
t=390

Cat to Man

prompt = "a photo of a hipster barista"

Edit
t=960

Edit
t=900

Edit
t=840

Edit
t=780

Edit
t=690

Edit
t=390

Sunset to Hipster Barista

Visual Anagrams

In this section, I create optical illusion with diffusion models. The model with generate images that look like one thing normally and another thing up side down. To implement this, the estimate noise from the model is modified according to the algorithm: $$\epsilon_1 = \text{UNet}(x_t, t, p_1)$$ $$\epsilon_2 = \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2))$$ $$\epsilon = (\epsilon_1 + \epsilon_2)/2$$

an old painting of people around a campfire	an old painting of an old man
a rocket ship	a pencil
a photo of a dog	a photo of a man

Hybrid Images

In this section, I create another optical illusion that looks like one thing closely and another thing far away. The estimate noise algorithm is modified following the algorithm: $$\epsilon_1 = \text{UNet}(x_t, t, p_1)$$ $$\epsilon_2 = \text{UNet}(x_t, t, p_2)$$ $$\epsilon = f_\text{lowpass}(\epsilon_1) + f_\text{highpass}(\epsilon_2)$$

Hybrid image of a skull and a waterfall

Hybrid image of a waterfall and a dog

Hybrid image of an old man and a rocket ship

Part B: Diffusion Models from Scratch

In this part, I will build the denoise Unet from scratch using torch.nn. I use the MNIST dataset from torchvision.datasets.MNIST in this part for training purpose.
For the whole part, I use random seed $24$ for reproducibility purpose.

Unconditional UNet

The Unconditional UNet structure is:

The standard tensor operations is defined as:

For forward process, I will add noise to image according to $\sigma$ level. The noise algorithm is: $$z=x+\sigma\epsilon,\text{ where }\epsilon\sim N(0, I)$$

Various Noise Level

The number of hidden layers is $128$ for unconditional UNet.
I train the model using noisy image $z$ with $\sigma=0.5$ applied to clean images $x$. The batch size is $256$ and number of epoch is $5$. I choose Adam optimizer with initial learning rate of $1e-4$
I train the model using an L2 loss: $$L=\mathbb{E}_{z,x}\lVert D_\theta(z)-x\rVert^2$$ The training loss is:

Unconditional UNet Traning Loss

The denoise results of the training in different epoch are(top: original image, middle: noisy image, bottom: estimate original image):

Unconditional UNet Denoise Result Epoch 1

Unconditional UNet Denoise Result Epoch 5

I also test the model for out of distribution noise levels:

Out of Distribution Noise Levels Denoise Result

Time Conditional UNet

To implement a diffusion model similar to part A, I need to add the variable of time to the model to perform iteratively denoising. To do this, I added two fully connected blocks to the model by the following structure:

The fully-connected block is defined as:

The forward process (adding noise) in this model is changed to: $$x_t = \sqrt{\bar{\alpha}_t}x_0+\sqrt{1-\bar{\alpha}_t}\epsilon\space\text{ where }\space\epsilon\sim N(0, 1)\space\text{for}\space t\in\{0,1,...,T\}$$ Some parameters are precomputed in the UNet. The computation according to DDPM paper is:
$\beta_t$: a list of $\beta$ of length $T+1$ such that $\beta_0=0.0001$ and $\beta_T=0.02$ and all other elements $\beta_t$ for $t\in\{1,...,T-1\}$ are evenly spaced between the two.
$\alpha_t$: $1-\beta_t$
$\bar{\alpha}_t$: $\prod_{s=1}^t{\alpha_s}$ is a cumulative product of $a_s$ for $s\in\{1,...,t\}$

The training algorithm of the model is:

The denoising algorithm of the model is:

For this UNet model, the number of hidden layers is $64$. The total timestep $T=300$.
For training, I uses a batch size of $128$ and number of epoch as $20$. I still use the Adam optimizer with an initial learning rate of $1e-3$. I also set an exponential learning rate decay scheduler with a gamma of $0.1^{(1.0/\text{num_epochs})}$ by calling torch.optim.lr_scheduler.ExponentialLR.
I still use L2 loss in training: $$L=\mathbb{E}_{\epsilon, x_0,t}\lVert\epsilon_\theta(x_t,t)-\epsilon\rVert ^2$$
The training loss is:

Time Conditional UNet Traning Loss

The denoise results of the training in different epoch are:

Time Conditional UNet Denoise Result Epoch 1

Time Conditional UNet Denoise Result Epoch 5

Time Conditional UNet Denoise Result Epoch 10

Time Conditional UNet Denoise Result Epoch 15

Time Conditional UNet Denoise Result Epoch 20

The denoised output after training is:

Time Conditional UNet Test Denoised Result

Class Contidional UNet

It's obvious that some of the denoised ouptput of the time conditional model is still not very good in generating digits. Hence I improve the model by adding a class condition. To implement this, I add two additional fully-connected blocks at the same places of time condition blocks.

The training algorithm of the model is:

The denoising algorithm of the model is:

The model parameters and traing process are similar to the time conditional UNet model. For the class-conditioning vector $c$, I use one-hot encoding to produce the one-hot vector from the labels of the dataset. Since the UNet need to work simetimes without the class condition, I implement a dropout of class condition at a probability of $10\%$ by setting the class condition vector to 0. The training loss is: