COMPSCI 180 Project 5: Fun with Diffusion Models!

Kaitlyn Chen

Introduction:

The goal of this project is to generate images using diffusion models. In part A of this project we experiment with a pretrained a text-to-image diffusion model (DeepFloyd IF model) to implement diffusion sampling loops for inpainting and creating optical illusions. In part B we train our own diffusion model on MNIST dataset to create images of MNIST digits.

Part A:

Part 0: Setup:

We first instantiate DeepFloyd's stage_1 and stage_2 objects used for generation, as well as several text prompts for sample generation. To the right are 3 text prompts used as inputs and their respective image outputs. I ran the model twice with different num_inference_steps, which tells how many denoising steps to take. The top three images are with num_inference_steps = 20, and the bottom three are with num_inference_steps = 100.
With 100 inference steps the images are more detailed and clearer.

Part 1: Sampling Loops

In this section we write our own “sampling loops” pretrained using DeepFloyd denoisers. A diffusion model, given a noisy model, predicts the noise in an image at timestep. This can be used to generate images from pure noise. Continuously predicting the noise at a time step t then removing part of the noise give us a less noisy image at t-1 until we arrive at a clean image. T is used to denoise timestep with pure noise, and $\bar\alpha_t$ In DeepFloyd models, T=1000.

1.1 Implementing the Forward Process

The forward process of diffusion is adding noise to a clean image and scaling. The equivalent functions below define it:

where $x_0$ is the input clean image, $x_t$ is the noisy image at timetep t, sampled from a Gaussian with mean $\sqrt{\bar\alpha_t}x_0$ and variance $(1 - \bar\alpha_t)$ . $\bar\alpha_t$ is given for all t. I utilized the function torch.randn_like to compute the noise.
Results from running the forward process on campanile image (resized to 64x64) with
$t \in [250, 500, 750]$ :

1.2 Classical Denoising

Now we denoise the campanile image with Gaussian blur filtering. I used the function torchvision.transforms.functional.gaussian_blur with kernel size 5 and sigma 2.

1.3 One-Step Denoising

Now we use a pretrianed diffusion model to denoise. The denoiser isstage_1.unet that has already been trained on a large data set of $(x_0, x_t)$ paris of images. We use this model to recover Gaussian noise and then remove the noise to get something close to the orignal campanile image. the model was trained on text so we used the previously downloaded text prompt embedding, and use the prompt “a high quality photo” which acts as a null prompt.

For the three noisy images generated from previously, I first denoise the image by estimating the noise with UNet. Then pass this through stage_1.unet to estimate the noise in a new noisy image. I then remove the noise from the noisy image to obtain an estimate of the original image. Visualizations of the results of each step:

1.4 Iterative Denoising

We can see that One-Step denoising from the previous steps actually produces “worse” images with more noise. Diffusion models should denoise iteratively, instead of at defined timesteps t, so in this section we do so. Denoising at all 1000 timesteps would be slow and computationally expensive, so instead we skip steps and turns out it works well. We essentially create a longer list of timesteps called strided_timesteps with strides of size 30. So when we are at the ith denoising step we are at t=strided_timesteps[i] and produce a less noisy t’=strided_timesteps[i+1]. We use the equation:

$x_t$ is your image at timestep 𝑡

$xt'$ is your noisy image at timestep t’ where 𝑡′<𝑡 (less noisy)

$\bar\alpha_t$ is defined by alphas_cumprod, as explained above.

$\alpha_t$ = $\bar\alpha_t/\bar\alpha_t'$

$\beta_t = 1-\alpha_t$

$x_0$ is our current estimate of the clean image

$v_\sigma$ is variance which will be defined by a add_variance function provided by staff

1.5 Diffusion Model Sampling

We can also generate images from scratch using iterative denoising. we juset set i_start to 0 and pass in random noise to basically denoise pure noise. I use torch.randn(1, 3, 64, 64).half() to generate the noise.

Examples with “a high quality photo” and “a photo of a dog”:

1.6 Classifier-Free Guidance (CFG)

The images from the last step were not great, so we will increase the quality here, although this decreases the variability. We do so with classifier-free guidance. This technique computes a conditional and unconditional noise estimate, $\epsilon_c$ and $\epsilon_u$ , then compute a noise estimate: $\epsilon = \epsilon_u + \gamma (\epsilon_c - \epsilon_u)$ . $\gamma$ is the strength of CFG where a value greater than 1 results in high quality results. Not sure why though. I created a function iterative_denoise_cfg that is similar to iterative_denoise from part 1.4 except for the noise estimate. To generate images we will use “” prompt as the null prompt for unconditional guidance for CFG. We run UNet twice for the conditonal prompt embedding and the unconditional. The variance is calculated with just the conditional variance.

5 Images of a “a high quality phot0” with
$\gamma$ = 7:

1.7 Image-to-image Translation

In this section we are going to use an image similar to the test image with low noise. This follows the SDEdit algorithm. I first ran a forward process to get a noisy image then ran the iterative_denoise_cfg function from the previous step using a starting index in [1, 3, 5, 7, 10, 20]. This “edits” the image gradually into the original image.

3 Examples with conditional prompt being “a high quality photo” and unconditional prompt “”:

Example 1 with campanile original image:

Example 2 with minion original image:

Example 3 with dog original image:

1.7.1 Editing Hand-Drawn and Web Images

SDEdit algorithm works even better for nonrealistic images such as paintings or a sketch, so in this section we will apply it to such. We will start with a nonrealistic image and project it onto the natural image manifold. I used the staff provided solution to generate a web image (the avocado) and to create my own sketch (the shark and house).

Example with web image:

Example 1 with hand drawn image of a shark:

Example 2 with hand drawn image of a house:

1.7.2 Inpainting

We can do the same thing to fill in a section of an image. Here we take an image $x_{original}$ and a binary mask $m$ , we then create a new image that preserves the original imae where $m$ is 0 or black and creates new image content where $m$ is 1 or white. To do this we do the same diffusion denoising loop as previously but manually force the same pixels in the output image as the input where the mask is 0, and just apply the newly generated image to where the mask is 1.

Example images with 2 different outputs:

Original Images

Mask

Hole to Fill

Inpainted example 1

Inpainted example 1

1.7.3 Text-Conditional Image-to-image Translation

Now we do the same thing as 1.7.2 but instead guide the projection with a text prompt.
Examples below where the rightmost image is the original passed in image.

Example with prompt "a rocket ship”:

Example with prompt "a photo of a man”:

Example with prompt "a photo of a dog”:

1.8 Visual Anagrams

Here we will create visual anagrams with diffusion models. So that two interpretations of an image are on one image just flipped. We denoise an image using the first text prompt to obtain the first noise estimate $\epsilon_1$ and also denoise that image flipped upside down with the second prompt to obtain a second noise estimate $\epsilon_2$ . We then simply slip $\epsilon_2$ back and average the two noise estimates. And then do the same denoising diffusion step just with this averaged noise estimate instead.

Two outputted examples using prompts “an oil painting of people around a campfire” and “an oil painting of an old man”:

Example with prompts “a photo of the amalfi cost” and “an oil painting of a snowy mountain village”:

Example with prompts “a photo of an old woman” and “a dress”:

1.9 Hybrid Images

In this section we will create hybrid images where you see the high frequencies of one image and low of another. This makes you see one thing upclose and another afar. This is done with Factorized Diffusion. A similar technique is used to create hybrid images as visual anagrams. we create 2 noise estimates with 2 prompts, then we simply add the low frequencies of one image to the high frequencies of another. To get the low frequencies of a noise estimate we simply pass in the noise estimate to torchvision.transforms.functional.gaussian_blur. I used a kernel size of 33 and sigma of 2.

Example images:

Part B:

Part 1: Training a Single-Step Denoising UNet

We will create a one step denoiser in part 1. We will train the denoiser $D_{\theta}$ to match a noisy image $z$ to a clean one $x$ , by optimizing over L2 loss:

$L = E_{z,x}||D_{\theta}(z) -x||^2$

1.1 Implementing the UNet

The denoiser will be implemented as a UNet, like the one from part a. It will take an image with some noise and ouptut a denoised image of a digit. The MNIST dataset has 28x28 pixels.

1.2 Using the UNet to Train a Denoiser

We need to first train our denoiser on data pairs of the noisy data and clean data, $(z, x)$ . Where $z = x + \sigma\epsilon$ and $\epsilon$ ~ $N(0, I)$ . Here we visualized the noising process over $\sigma$ =[0, .2, .4, .5, .6, .8,1]

Figure 3: Varying levels of noise on MNIST digits

1.2.1 Training

we now train the model for denoising using $\sigma$ = 0.5 on a clean image. We use the MNIST dataset to get the training and test sets. We use the batch size 256, over dataset of 5 epocks. We will use our UNet architecture with hidden dimensions D=128. We will use the adam optimizer with learning rate of 1e-4.

Figure 5: Results on digits from the test set after 1 epoch of training

Figure 6: Results on digits from the test set after 5 epochs of training

1.2.2 Out-of-Distribution Testing

From the previous section, our denoiser is trained with $\sigma$ =0.5, so we will see how well it does with other sigma values. Below are outputs of the denoiser on the test set with different sigma values.

Figure 7: Results on digits from the test set with varying noise levels.

Part 2: Training a Diffusion Model

Now we are going to train UNet model to iteratively denoise an image (DDP implementation). We will change UNet to predict added noise $\epsilon$ instead of the clean image $x$ :

$L = \mathbb{E}{\epsilon,z} \|\epsilon_{\theta}(z) - \epsilon\|^2$ , $\epsilon_{\theta}$ is UNet trained noise prediction. We want to sample pure noise $\epsilon$ ~ $N(0,I)$ and iteratively denoise to generate an image x. We will use timesteps to denoise with the same equation as earlier: $x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1 - \bar\alpha_t} \epsilon \quad \text{where}~ \epsilon \sim N(0, 1).$ We use the staff provided derivation to build list $\bar{\alpha}$ for all timesteps from 0 to T, $\alpha$ , and $\beta$ . we will use T=300. We will condition a single UNet with timestep t, so our loss func is $L = \mathbb{E}_{\epsilon,x_0,t} \|\epsilon_{\theta}(x_t, t) - \epsilon\|^2.$

2.1 Adding Time Conditioning to UNet

We will now adjest our UNet model to include stalar t. The FCBlock (fully-connected block) for the conditioning signal. Linear(F_in, F_out) is a linear layer with F_in input features and F_out output features.

2.2 Training the UNet

To train we now pick a random image from the training set and a random t, and we train the denoiser to predict the noise in $x_t$ . We do this for different images and ts until the model converges. Following this pseudo code:

Now we will use a hidden dimension of 64, adam opimizer with learning rate 1e-3, an exponential learning rate decay scheduler with gamma of $0.1^{(1.0/numepochs)}$ . Below is a plot of the training loss curve for the time-conditioned UNet over the whole training process

Figure 10: Time-Conditioned UNet training loss curve

2.3 Sampling from the UNet

We do the same sampling process as part A but we dont predict the variance, instead we use $\beta$ . Here is the sampling pseudo code algorithm:

Below are the sampling results for time-conditioned UNet for 5 and 20 epochs. We can see that some were not denoised that great.

2.4 Adding Class-Conditioning to the UNet

For better quality images and control we can condition our UNet on class of digits 0 though 9 (each class will be represented with one hot encoded vector) and also have 2 fully connected blocks to feed into our class. we will also use dropout to set the class conditioning vector c to 0 10% of the time. The algorithm for training is similar to the previous section. This section will have a conditioning vector c and unconditional generation based on probability. Below is the pseudo code algorithm for class conditioned training

2.5 Sampling from the Class-Conditioned UNet

This sampling process is also very similar to part A. We will also use classsifer-free guidance in this part to for better results. We will use $\gamma = 5.0$ . Below is the pseudo code algorithm for sampling, and my sampling results for the class-conditioned UNet for 5 and 20 epochs, with 4 instances of each digit.