Overview
The project contains two parts that experiments with diffusion models. For the first part, I interact with a
pretrain diffusion model
DeepFloyd IF
to perform several types of image generation. For the second part, I build and train my own diffusion model using
torch.nn.Module
with the MNIST dataset.
Part A: The Power of Diffusion Models
Setup
For the whole part, I use the pretrain DeepFloyd IF imported from
Hugging Face. The model has two
stages. The first one takes in noisy images of size $64\times 64$ and text embeddings to generate a denoised image.
The second stage takes in the output of the first stage and generates images of size $256\times 256$.
In the forward process, as $t$ increases, the images will get noisier. And the backward process, which is what the
denoiser do in stage one of the diffusion model is to estimate the noise in the image.
For the whole part, I use random seed $24$ for reproducibility purpose. I also use the text embeddings of
"a high quality photo"
as a default text embeddings if not mentioned specifically.
Here's some images generated from the model:
Inference steps = 5
an oil painting of a snowy mountain stage 1
|
a man wearing a hat stage 1
|
a rocket ship stage 1
|
an oil painting of a snowy mountain stage 2
|
a man wearing a hat stage 2
|
a rocket ship stage 2
|
Inference steps = 20
an oil painting of a snowy mountain stage 1
|
a man wearing a hat stage 1
|
a rocket ship stage 1
|
an oil painting of a snowy mountain stage 2
|
a man wearing a hat stage 2
|
a rocket ship stage 2
|
Forward Process
Forward process in diffusion models is to add noise to clean images.
The forward process algorithm is defined by:
$$q(x_t|x_0)=N(x_t;\sqrt{\bar{\alpha}}x_0, (1-\bar{\alpha}_t)I)$$
is equivalent to:
$$x_t=\sqrt{\bar{\alpha}_t}x_0+\sqrt{1-\bar{\alpha}_t}\epsilon \text{ where }\epsilon\sim N(0,1)$$
$x_t$: noisy images
$x_0$: clean images
$\epsilon$: noise
$\bar{\alpha}_t$:
alpha_cumprod
, determined by the trainer of the model
Berkeley Campanile
|
Noisy Campanile at t=250
|
Noisy Campanile at t=500
|
Noisy Campanile at t=750
|
Classical Denoising
Calssically, I use gaussian blur filter to try to get rid of noise. But in this case this classical denoising does
not work well.
Noisy Campanile at t=250
|
Noisy Campanile at t=500
|
Noisy Campanile at t=750
|
Gaussian Denoised Campanile at t=250
|
Gaussian Denoised Campanile at t=500
|
Gaussian Denoised Campanile at t=750
|
One-Step Denoising
One step denoising uses the pretrained diffusion model to denoise. The denoiser located at
stage_1.unet
. This denoiser estimates the noise in the noisy images given the timestep. Then remove the
noise from noisy images can recover the estimate of original images.
Berkeley Campanile
|
Berkeley Campanile
|
Berkeley Campanile
|
Noisy Campanile at t=250
|
Noisy Campanile at t=500
|
Noisy Campanile at t=750
|
One-Step Denoised Campanile at t=250
|
One-Step Denoised Campanile at t=500
|
One-Step Denoised Campanile at t=750
|
Iterative Denoising
It's obvious that the Unet denoiser works much better than the Gaussian denoiser. But the result is still blurry as
more noise added to the image. To make the performance even better, I implement the iterative denoising. In theory,
the diffusion model allows me to iteratively denoising for 1000 timesteps. But to save time, I use a stride of 30 in
total timestep of $T=1000$. I generated a list of times steps
strided_timesteps
, with values:
[990, 960, 930, 900, 870, 840, 810, 780, 750, 720, 690, 660, 630, 600, 570, 540, 510, 480, 450, 420, 390, 360, 330, 300, 270, 240, 210, 180, 150, 120, 90, 60, 30, 0]
The denoising algorithm on the
i
th step is:
$$x_{t'}=\frac{\sqrt{\bar{\alpha}_{t'}}\beta_t}{1-\bar{\alpha}_t}x_0+\frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t'})}{1-\bar{\alpha}_t}x_t+v_\sigma$$
$t$: time at
strided_timesteps[i]
$t'$: time at
strided_timesteps[i+1]
$x_t$: image at timestep $t$
$x_{t'}$: image at timestep $t'$
$\bar{\alpha}_t$:
alpha_cumprod
$\alpha_t$: $\frac{\bar{\alpha}_t}{\bar{\alpha}_{t'}}$
$\beta_t$: $1-\alpha_t$
$x_0$: the current estimate of clean image as in the one-step denoising
$v_\sigma$: random noise
Noisy Campanile at t=90
|
Noisy Campanile at t=240
|
Noisy Campanile at t=390
|
Noisy Campanile at t=540
|
Noisy Campanile at t=690
|
Noisy Campanile at t=840
|
Noisy Campanile at t=990
|
In this part, I use
i_start = 10
, which correspondence to timestep 690.
Berkeley Campanile
|
Noisy Campanile at t=690
|
Iterative Denoised Campanile
|
One-Step Denoised Campanile
|
Gaussian Denoised Campanile
|
Diffusion Model Sampling
By taking
i_start = 0
, the algorithm would denoise from pure noise and start generate images.
Sample 1
|
Sample 2
|
Sample 3
|
Sample 4
|
Sample 5
|
Classifier-Free Guidance (CFG)
Some images generated in the previous section are not very good. To improve the quality of images, I use a
technicque called
Classifier-Free Guidance. In this technicque, the
algorithm computes both a conditional and an unconditional noise estimate, then the new noise will be:
$$\epsilon=\epsilon_u+\gamma(\epsilon_c+\epsilon_u)$$
By taking $\gamma>1$, we get a much higher quality images. For this and later sections, I use text embeddings
""
as the unconditional prompt and
"a high quality photo"
as the conditional prompt.
Sample 1 with CFG
|
Sample 2 with CFG
|
Sample 3 with CFG
|
Sample 4 with CFG
|
Sample 5 with CFG
|
The results compare to the previous section is much more vivid and high-contrast.
Image-to-Image Translation
Image-to-image translation takes in a clean image, adds noise to it to a level, and then denoises it. This allows
edits to existings images. The more noise it adds, the larger the edit will be.
Edit with t=960
|
Edit with t=900
|
Edit with t=840
|
Edit with t=780
|
Edit with t=690
|
Edit with t=390
|
Berkeley Campanile
|
Edit with t=960
|
Edit with t=900
|
Edit with t=840
|
Edit with t=780
|
Edit with t=690
|
Edit with t=390
|
Cat Meme
|
Edit with t=960
|
Edit with t=900
|
Edit with t=840
|
Edit with t=780
|
Edit with t=690
|
Edit with t=390
|
Sad Meme
|
Editing Hand-Drawn and Web Images
Except for taking in realistic images, the algorithm can also edit hand-drawn and web images.
Edit with t=960
|
Edit with t=900
|
Edit with t=840
|
Edit with t=780
|
Edit with t=690
|
Edit with t=390
|
Web Img
|
Edit with t=960
|
Edit with t=900
|
Edit with t=840
|
Edit with t=780
|
Edit with t=690
|
Edit with t=390
|
Hand-Drawn 1
|
Edit with t=960
|
Edit with t=900
|
Edit with t=840
|
Edit with t=780
|
Edit with t=690
|
Edit with t=390
|
Hand-Drawn 2
|
Inpainting
Given an image and a mask, I can also generate images that only the masked area changes while other area stays the
same. In each loop, the new image will be:
$$x_t\leftarrow mx_t+(1-m)\text{forward}(x_{orig},t)$$
$x_{orig}$: original image
$m$: binary mask
Campanile
|
Mask
|
Hole to Fill
|
Campanile Filled
|
Cat
|
Mask
|
Hole to Fill
|
Cat Filled
|
Sunset
|
Mask
|
Hole to Fill
|
Sunset Filled
|
Text-Conditional Image-toimage Translation
In this section, I experiments image editing by changing different text prompt from
"a high quality photo"
.
prompt = "a rocket ship"
Edit t=960
|
Edit t=900
|
Edit t=840
|
Edit t=780
|
Edit t=690
|
Edit t=390
|
Campanile to Rocket Ship
|
prompt = "a photo of a man"
Edit t=960
|
Edit t=900
|
Edit t=840
|
Edit t=780
|
Edit t=690
|
Edit t=390
|
Cat to Man
|
prompt = "a photo of a hipster barista"
Edit t=960
|
Edit t=900
|
Edit t=840
|
Edit t=780
|
Edit t=690
|
Edit t=390
|
Sunset to Hipster Barista
|
Visual Anagrams
In this section, I create optical illusion with diffusion models. The model with generate images that look like one
thing normally and another thing up side down. To implement this, the estimate noise from the model is modified
according to the algorithm:
$$\epsilon_1 = \text{UNet}(x_t, t, p_1)$$
$$\epsilon_2 = \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2))$$
$$\epsilon = (\epsilon_1 + \epsilon_2)/2$$
an old painting of people around a campfire
|
an old painting of an old man
|
a rocket ship
|
a pencil
|
a photo of a dog
|
a photo of a man
|
Hybrid Images
In this section, I create another optical illusion that looks like one thing closely and another thing far away.
The estimate noise algorithm is modified following the algorithm:
$$\epsilon_1 = \text{UNet}(x_t, t, p_1)$$
$$\epsilon_2 = \text{UNet}(x_t, t, p_2)$$
$$\epsilon = f_\text{lowpass}(\epsilon_1) + f_\text{highpass}(\epsilon_2)$$
Hybrid image of a skull and a waterfall
|
Hybrid image of a waterfall and a dog
|
Hybrid image of an old man and a rocket ship
|
Part B: Diffusion Models from Scratch
In this part, I will build the denoise Unet from scratch using
torch.nn
. I use the MNIST dataset from
torchvision.datasets.MNIST
in this part for training purpose.
For the whole part, I use random seed $24$ for reproducibility purpose.
Unconditional UNet
The Unconditional UNet structure is:
The standard tensor operations is defined as:
For forward process, I will add noise to image according to $\sigma$ level. The noise algorithm is:
$$z=x+\sigma\epsilon,\text{ where }\epsilon\sim N(0, I)$$
Various Noise Level
|
The number of hidden layers is $128$ for unconditional UNet.
I train the model using noisy image $z$ with $\sigma=0.5$
applied to clean images $x$. The batch size is $256$ and number of epoch is $5$. I choose Adam optimizer with
initial learning rate of $1e-4$
I train the model using an L2 loss:
$$L=\mathbb{E}_{z,x}\lVert D_\theta(z)-x\rVert^2$$
The training loss is:
Unconditional UNet Traning Loss
|
The denoise results of the training in different epoch are(top: original image, middle: noisy image, bottom:
estimate original image):
Unconditional UNet Denoise Result Epoch 1
|
Unconditional UNet Denoise Result Epoch 5
|
I also test the model for out of distribution noise levels:
Out of Distribution Noise Levels Denoise Result
|
Time Conditional UNet
To implement a diffusion model similar to part A, I need to add the variable of time to the model to perform
iteratively denoising. To do this, I added two fully connected blocks to the model by the following structure:
The fully-connected block is defined as:
The forward process (adding noise) in this model is changed to:
$$x_t = \sqrt{\bar{\alpha}_t}x_0+\sqrt{1-\bar{\alpha}_t}\epsilon\space\text{ where }\space\epsilon\sim N(0,
1)\space\text{for}\space t\in\{0,1,...,T\}$$
Some parameters are precomputed in the UNet. The computation according to
DDPM paper is:
$\beta_t$: a list of $\beta$ of length $T+1$ such that $\beta_0=0.0001$ and $\beta_T=0.02$ and all other elements
$\beta_t$ for $t\in\{1,...,T-1\}$ are evenly spaced between the two.
$\alpha_t$: $1-\beta_t$
$\bar{\alpha}_t$: $\prod_{s=1}^t{\alpha_s}$ is a cumulative product of $a_s$ for $s\in\{1,...,t\}$
The training algorithm of the model is:
The denoising algorithm of the model is:
For this UNet model, the number of hidden layers is $64$. The total timestep $T=300$.
For training, I uses a batch size of $128$ and number of epoch as $20$. I still use the Adam optimizer with an
initial learning rate of $1e-3$. I also set an exponential learning rate decay scheduler with a gamma of
$0.1^{(1.0/\text{num_epochs})}$ by calling
torch.optim.lr_scheduler.ExponentialLR
.
I still use L2 loss in training:
$$L=\mathbb{E}_{\epsilon, x_0,t}\lVert\epsilon_\theta(x_t,t)-\epsilon\rVert ^2$$
The training loss is:
Time Conditional UNet Traning Loss
|
The denoise results of the training in different epoch are:
Time Conditional UNet Denoise Result Epoch 1
|
Time Conditional UNet Denoise Result Epoch 5
|
Time Conditional UNet Denoise Result Epoch 10
|
Time Conditional UNet Denoise Result Epoch 15
|
Time Conditional UNet Denoise Result Epoch 20
|
The denoised output after training is:
Time Conditional UNet Test Denoised Result
|
Class Contidional UNet
It's obvious that some of the denoised ouptput of the time conditional model is still not very good in generating
digits. Hence I improve the model by adding a class condition. To implement this, I add two additional
fully-connected blocks at the same places of time condition blocks.
The training algorithm of the model is:
The denoising algorithm of the model is:
The model parameters and traing process are similar to the time conditional UNet model. For the class-conditioning
vector $c$, I use one-hot encoding to produce the one-hot vector from the labels of the dataset. Since the UNet need
to work simetimes without the class condition, I implement a dropout of class condition at a probability of $10\%$ by
setting the class condition vector to 0.
The training loss is:
Class Conditional UNet Traning Loss
|
The denoise results of the training in different epoch are:
Class Conditional UNet Denoise Result Epoch 1
|
Class Conditional UNet Denoise Result Epoch 5
|
Class Conditional UNet Denoise Result Epoch 10
|
Class Conditional UNet Denoise Result Epoch 15
|
Class Conditional UNet Denoise Result Epoch 20
|
The denoised output after training is:
Class Conditional UNet Test Denoised Result
|