torch.nn.Module
with the MNIST dataset."a high quality photo"
as a default text embeddings if not mentioned specifically.
stage 1 |
stage 1 |
stage 1 |
stage 2 |
stage 2 |
stage 2 |
stage 1 |
stage 1 |
stage 1 |
stage 2 |
stage 2 |
stage 2 |
alpha_cumprod
, determined by the trainer of the model
|
|
|
|
|
|
|
at t=250 |
at t=500 |
at t=750 |
stage_1.unet
. This denoiser estimates the noise in the noisy images given the timestep. Then remove the
noise from noisy images can recover the estimate of original images.
|
|
|
|
|
|
at t=250 |
at t=500 |
at t=750 |
strided_timesteps
, with values:
[990, 960, 930, 900, 870, 840, 810, 780, 750, 720, 690, 660, 630, 600, 570, 540, 510, 480, 450, 420, 390, 360, 330, 300, 270, 240, 210, 180, 150, 120, 90, 60, 30, 0]
i
th step is:
$$x_{t'}=\frac{\sqrt{\bar{\alpha}_{t'}}\beta_t}{1-\bar{\alpha}_t}x_0+\frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t'})}{1-\bar{\alpha}_t}x_t+v_\sigma$$
$t$: time at strided_timesteps[i]
strided_timesteps[i+1]
alpha_cumprod
|
|
|
|
|
|
|
i_start = 10
, which correspondence to timestep 690.
Campanile |
at t=690 |
|
|
|
i_start = 0
, the algorithm would denoise from pure noise and start generate images.
|
|
|
|
|
""
as the unconditional prompt and "a high quality photo"
as the conditional prompt.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
"a high quality photo"
.
t=960 |
t=900 |
t=840 |
t=780 |
t=690 |
t=390 |
|
t=960 |
t=900 |
t=840 |
t=780 |
t=690 |
t=390 |
|
t=960 |
t=900 |
t=840 |
t=780 |
t=690 |
t=390 |
|
people around a campfire |
an old man |
|
|
|
|
|
|
|
torch.nn
. I use the MNIST dataset from
torchvision.datasets.MNIST
in this part for training purpose.
|
|
|
|
|
torch.optim.lr_scheduler.ExponentialLR
.
|
|
|
|
|
|
|
|
|
|
|
|
|
|