CS 180 Project 5: Fun With Diffusion Models!

Jaxon Zeng



Overview

The project contains two parts that experiments with diffusion models. For the first part, I interact with a pretrain diffusion model DeepFloyd IF to perform several types of image generation. For the second part, I build and train my own diffusion model using torch.nn.Module with the MNIST dataset.

Part A: The Power of Diffusion Models

Setup

For the whole part, I use the pretrain DeepFloyd IF imported from Hugging Face. The model has two stages. The first one takes in noisy images of size $64\times 64$ and text embeddings to generate a denoised image. The second stage takes in the output of the first stage and generates images of size $256\times 256$.
In the forward process, as $t$ increases, the images will get noisier. And the backward process, which is what the denoiser do in stage one of the diffusion model is to estimate the noise in the image.
For the whole part, I use random seed $24$ for reproducibility purpose. I also use the text embeddings of "a high quality photo" as a default text embeddings if not mentioned specifically.
Here's some images generated from the model:
Inference steps = 5
an oil painting of a snowy mountain
stage 1
a man wearing a hat
stage 1
a rocket ship
stage 1
an oil painting of a snowy mountain
stage 2
a man wearing a hat
stage 2
a rocket ship
stage 2


Inference steps = 20
an oil painting of a snowy mountain
stage 1
a man wearing a hat
stage 1
a rocket ship
stage 1
an oil painting of a snowy mountain
stage 2
a man wearing a hat
stage 2
a rocket ship
stage 2


Forward Process

Forward process in diffusion models is to add noise to clean images. The forward process algorithm is defined by: $$q(x_t|x_0)=N(x_t;\sqrt{\bar{\alpha}}x_0, (1-\bar{\alpha}_t)I)$$ is equivalent to: $$x_t=\sqrt{\bar{\alpha}_t}x_0+\sqrt{1-\bar{\alpha}_t}\epsilon \text{ where }\epsilon\sim N(0,1)$$ $x_t$: noisy images
$x_0$: clean images
$\epsilon$: noise
$\bar{\alpha}_t$: alpha_cumprod, determined by the trainer of the model

Berkeley Campanile
Noisy Campanile at t=250
Noisy Campanile at t=500
Noisy Campanile at t=750


Classical Denoising

Calssically, I use gaussian blur filter to try to get rid of noise. But in this case this classical denoising does not work well.
Noisy Campanile at t=250
Noisy Campanile at t=500
Noisy Campanile at t=750
Gaussian Denoised Campanile
at t=250
Gaussian Denoised Campanile
at t=500
Gaussian Denoised Campanile
at t=750


One-Step Denoising

One step denoising uses the pretrained diffusion model to denoise. The denoiser located at stage_1.unet. This denoiser estimates the noise in the noisy images given the timestep. Then remove the noise from noisy images can recover the estimate of original images.
Berkeley Campanile
Berkeley Campanile
Berkeley Campanile
Noisy Campanile at t=250
Noisy Campanile at t=500
Noisy Campanile at t=750
One-Step Denoised Campanile
at t=250
One-Step Denoised Campanile
at t=500
One-Step Denoised Campanile
at t=750


Iterative Denoising

It's obvious that the Unet denoiser works much better than the Gaussian denoiser. But the result is still blurry as more noise added to the image. To make the performance even better, I implement the iterative denoising. In theory, the diffusion model allows me to iteratively denoising for 1000 timesteps. But to save time, I use a stride of 30 in total timestep of $T=1000$. I generated a list of times steps strided_timesteps, with values: [990, 960, 930, 900, 870, 840, 810, 780, 750, 720, 690, 660, 630, 600, 570, 540, 510, 480, 450, 420, 390, 360, 330, 300, 270, 240, 210, 180, 150, 120, 90, 60, 30, 0]

The denoising algorithm on the ith step is: $$x_{t'}=\frac{\sqrt{\bar{\alpha}_{t'}}\beta_t}{1-\bar{\alpha}_t}x_0+\frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t'})}{1-\bar{\alpha}_t}x_t+v_\sigma$$ $t$: time at strided_timesteps[i]
$t'$: time at strided_timesteps[i+1]
$x_t$: image at timestep $t$
$x_{t'}$: image at timestep $t'$
$\bar{\alpha}_t$: alpha_cumprod
$\alpha_t$: $\frac{\bar{\alpha}_t}{\bar{\alpha}_{t'}}$
$\beta_t$: $1-\alpha_t$
$x_0$: the current estimate of clean image as in the one-step denoising
$v_\sigma$: random noise

Noisy Campanile at t=90
Noisy Campanile at t=240
Noisy Campanile at t=390
Noisy Campanile at t=540
Noisy Campanile at t=690
Noisy Campanile at t=840
Noisy Campanile at t=990

In this part, I use i_start = 10, which correspondence to timestep 690.
Berkeley
Campanile
Noisy Campanile
at t=690
Iterative Denoised Campanile
One-Step Denoised Campanile
Gaussian Denoised Campanile


Diffusion Model Sampling

By taking i_start = 0, the algorithm would denoise from pure noise and start generate images.

Sample 1
Sample 2
Sample 3
Sample 4
Sample 5


Classifier-Free Guidance (CFG)

Some images generated in the previous section are not very good. To improve the quality of images, I use a technicque called Classifier-Free Guidance. In this technicque, the algorithm computes both a conditional and an unconditional noise estimate, then the new noise will be: $$\epsilon=\epsilon_u+\gamma(\epsilon_c+\epsilon_u)$$ By taking $\gamma>1$, we get a much higher quality images. For this and later sections, I use text embeddings "" as the unconditional prompt and "a high quality photo" as the conditional prompt.

Sample 1 with CFG
Sample 2 with CFG
Sample 3 with CFG
Sample 4 with CFG
Sample 5 with CFG

The results compare to the previous section is much more vivid and high-contrast.

Image-to-Image Translation

Image-to-image translation takes in a clean image, adds noise to it to a level, and then denoises it. This allows edits to existings images. The more noise it adds, the larger the edit will be.
Edit with t=960
Edit with t=900
Edit with t=840
Edit with t=780
Edit with t=690
Edit with t=390
Berkeley Campanile
Edit with t=960
Edit with t=900
Edit with t=840
Edit with t=780
Edit with t=690
Edit with t=390
Cat Meme
Edit with t=960
Edit with t=900
Edit with t=840
Edit with t=780
Edit with t=690
Edit with t=390
Sad Meme


Editing Hand-Drawn and Web Images

Except for taking in realistic images, the algorithm can also edit hand-drawn and web images.
Edit with t=960
Edit with t=900
Edit with t=840
Edit with t=780
Edit with t=690
Edit with t=390
Web Img
Edit with t=960
Edit with t=900
Edit with t=840
Edit with t=780
Edit with t=690
Edit with t=390
Hand-Drawn 1
Edit with t=960
Edit with t=900
Edit with t=840
Edit with t=780
Edit with t=690
Edit with t=390
Hand-Drawn 2


Inpainting

Given an image and a mask, I can also generate images that only the masked area changes while other area stays the same. In each loop, the new image will be: $$x_t\leftarrow mx_t+(1-m)\text{forward}(x_{orig},t)$$ $x_{orig}$: original image
$m$: binary mask

Campanile
Mask
Hole to Fill
Campanile Filled
Cat
Mask
Hole to Fill
Cat Filled
Sunset
Mask
Hole to Fill
Sunset Filled


Text-Conditional Image-toimage Translation

In this section, I experiments image editing by changing different text prompt from "a high quality photo".

prompt = "a rocket ship"
Edit
t=960
Edit
t=900
Edit
t=840
Edit
t=780
Edit
t=690
Edit
t=390
Campanile to Rocket Ship

prompt = "a photo of a man"
Edit
t=960
Edit
t=900
Edit
t=840
Edit
t=780
Edit
t=690
Edit
t=390
Cat to Man

prompt = "a photo of a hipster barista"
Edit
t=960
Edit
t=900
Edit
t=840
Edit
t=780
Edit
t=690
Edit
t=390
Sunset to Hipster Barista


Visual Anagrams

In this section, I create optical illusion with diffusion models. The model with generate images that look like one thing normally and another thing up side down. To implement this, the estimate noise from the model is modified according to the algorithm: $$\epsilon_1 = \text{UNet}(x_t, t, p_1)$$ $$\epsilon_2 = \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2))$$ $$\epsilon = (\epsilon_1 + \epsilon_2)/2$$
an old painting of
people around a campfire
an old painting of
an old man
a rocket ship
a pencil
a photo of a dog
a photo of a man


Hybrid Images

In this section, I create another optical illusion that looks like one thing closely and another thing far away. The estimate noise algorithm is modified following the algorithm: $$\epsilon_1 = \text{UNet}(x_t, t, p_1)$$ $$\epsilon_2 = \text{UNet}(x_t, t, p_2)$$ $$\epsilon = f_\text{lowpass}(\epsilon_1) + f_\text{highpass}(\epsilon_2)$$
Hybrid image of a skull and a waterfall
Hybrid image of a waterfall and a dog
Hybrid image of an old man and a rocket ship




Part B: Diffusion Models from Scratch

In this part, I will build the denoise Unet from scratch using torch.nn. I use the MNIST dataset from torchvision.datasets.MNIST in this part for training purpose.
For the whole part, I use random seed $24$ for reproducibility purpose.

Unconditional UNet

The Unconditional UNet structure is:
The standard tensor operations is defined as:

For forward process, I will add noise to image according to $\sigma$ level. The noise algorithm is: $$z=x+\sigma\epsilon,\text{ where }\epsilon\sim N(0, I)$$
Various Noise Level


The number of hidden layers is $128$ for unconditional UNet.
I train the model using noisy image $z$ with $\sigma=0.5$ applied to clean images $x$. The batch size is $256$ and number of epoch is $5$. I choose Adam optimizer with initial learning rate of $1e-4$
I train the model using an L2 loss: $$L=\mathbb{E}_{z,x}\lVert D_\theta(z)-x\rVert^2$$ The training loss is:
Unconditional UNet Traning Loss


The denoise results of the training in different epoch are(top: original image, middle: noisy image, bottom: estimate original image):
Unconditional UNet Denoise Result Epoch 1
Unconditional UNet Denoise Result Epoch 5

I also test the model for out of distribution noise levels:
Out of Distribution Noise Levels Denoise Result


Time Conditional UNet

To implement a diffusion model similar to part A, I need to add the variable of time to the model to perform iteratively denoising. To do this, I added two fully connected blocks to the model by the following structure:
The fully-connected block is defined as:


The forward process (adding noise) in this model is changed to: $$x_t = \sqrt{\bar{\alpha}_t}x_0+\sqrt{1-\bar{\alpha}_t}\epsilon\space\text{ where }\space\epsilon\sim N(0, 1)\space\text{for}\space t\in\{0,1,...,T\}$$ Some parameters are precomputed in the UNet. The computation according to DDPM paper is:
$\beta_t$: a list of $\beta$ of length $T+1$ such that $\beta_0=0.0001$ and $\beta_T=0.02$ and all other elements $\beta_t$ for $t\in\{1,...,T-1\}$ are evenly spaced between the two.
$\alpha_t$: $1-\beta_t$
$\bar{\alpha}_t$: $\prod_{s=1}^t{\alpha_s}$ is a cumulative product of $a_s$ for $s\in\{1,...,t\}$

The training algorithm of the model is:

The denoising algorithm of the model is:



For this UNet model, the number of hidden layers is $64$. The total timestep $T=300$.
For training, I uses a batch size of $128$ and number of epoch as $20$. I still use the Adam optimizer with an initial learning rate of $1e-3$. I also set an exponential learning rate decay scheduler with a gamma of $0.1^{(1.0/\text{num_epochs})}$ by calling torch.optim.lr_scheduler.ExponentialLR.
I still use L2 loss in training: $$L=\mathbb{E}_{\epsilon, x_0,t}\lVert\epsilon_\theta(x_t,t)-\epsilon\rVert ^2$$
The training loss is:
Time Conditional UNet Traning Loss

The denoise results of the training in different epoch are:
Time Conditional UNet Denoise Result Epoch 1
Time Conditional UNet Denoise Result Epoch 5
Time Conditional UNet Denoise Result Epoch 10
Time Conditional UNet Denoise Result Epoch 15
Time Conditional UNet Denoise Result Epoch 20

The denoised output after training is:
Time Conditional UNet Test Denoised Result


Class Contidional UNet

It's obvious that some of the denoised ouptput of the time conditional model is still not very good in generating digits. Hence I improve the model by adding a class condition. To implement this, I add two additional fully-connected blocks at the same places of time condition blocks.

The training algorithm of the model is:

The denoising algorithm of the model is:


The model parameters and traing process are similar to the time conditional UNet model. For the class-conditioning vector $c$, I use one-hot encoding to produce the one-hot vector from the labels of the dataset. Since the UNet need to work simetimes without the class condition, I implement a dropout of class condition at a probability of $10\%$ by setting the class condition vector to 0. The training loss is:
Class Conditional UNet Traning Loss
The denoise results of the training in different epoch are:
Class Conditional UNet Denoise Result Epoch 1
Class Conditional UNet Denoise Result Epoch 5
Class Conditional UNet Denoise Result Epoch 10
Class Conditional UNet Denoise Result Epoch 15
Class Conditional UNet Denoise Result Epoch 20

The denoised output after training is:
Class Conditional UNet Test Denoised Result