cs180-portfolio

Proj5a: Power of Diffusion Models

Diffusion model shenanigans. Using DeepFloyd IF diffusion model trained by Stablility AI. This model takes in 64x64 images and produces 64x64 images from its first stage. I did not upsample the images into 256x256 images using the second stage of the model due to lack of google colab credits :(. The first part will be going through the implementation of a diffusion model and its steps while the second part will be implementing some cool results from some fairly recent papers.

0.1 Seed + Setup

I used SEED=501.

Here some of the ouput images after passing it through stage 1 and 2 UNets of the model. It appears that the number of iterative steps the model takes affects the output image pretty drastically even though the same word embedding was used.

10 steps

20 steps

40 steps

1.1 Forward Function

We need to implement a function that adds noise to an image. This is achieved with this formula:

\[x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1 - \bar\alpha_t}\epsilon, \space where \space \epsilon \sim N(0, 1)\]

We are using a noise generator (or estimation) using a standard normal distirbution $\epsilon$, which can be calculated via torch.randn_like, and an alpha_cumprod $\bar\alpha_t$ of $t$ step. As $t$ increases, so does the amount of noise added to the image increase.

campanile.jpg

t=250

t=500

t=750

1.2 Classical Denoising

After noise-ifying images, we can train a diffusion model to estimate denosising processes of these noisified image. That way the diffusion model can later “generate” images by converting random noisy images not on the real image manifold into images related to the input fed into the model.

We will first try to implement a classical denoising method via Gaussian Blur Filtering. In particular, I used torchvision.trnasforms.functional.gaussian_blur with a kernel size of (7, 7) to implement the blurs on the noisy images. This will pass the noisy image through a low pass filter and therefore get rid of some of the low frequency noise. However, as seen by the results, the Gaussian Blur Filter doesn’t get rid of all the noise and it also blurs the original image.

t=250

t=500

t=750

blur t=250

blur t=500

blur t=750

1.3 One Step Denoising

We can further improve the denoising by using a pretrained diffusion model to estimate the noise in the new noisy image and then remove that estiamted noise from that same noisy image to get closer towards the original image. Since DeepFloyd was trained on text conditioning, we use the first stage UNet on the condition of "a high quality photo".

In comparison to the Gaussian Blur Filter, this method of denoising gets rid of all the noise. However, the predicted image still tends to be blurred and loeses some of the structure and detailes that were in the original image.

t=250

t=500

t=750

one-step t=250

one-step t=500

one-step t=750

1.4 Iterative Denoising

Another method of denoising we can use is iterative denoising, the default denoising method used by diffusion models. It would be tedious and expensive to go through each step, espcially if $T$ is very large. Therefore, we iterate through some strided_timesteps with strides=30. The formula is given below with $t$ being the current timestep and $t’$ being an earlier timestep such that $t’ < t$.

\[x_{t'} = \frac{\sqrt{\bar\alpha_{t'}}\beta_t}{1 - \bar\alpha_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar\alpha_{t'})}{1 - \bar\alpha_t} x_t + v_\sigma\] \[\alpha_t = \frac{\bar\alpha_t}{\bar\alpha_{t'}}\] \[\beta_t = 1 - \alpha_t\]

$x_0$ is the estimated clean image at each iterative step using the formula used in the forward process with noise $\epsilon$ being the estimated noise from UNet output.

t=90

t=240

t=390

t=540

t=690

campanile.jpg

iterative

one-step

gaussian blur

1.5 Diffusion Model Sampling

We are going to generate images from scratch by starting the iterative denoising at $T$ timestep (the max timestep) and feeding the model a random noisy image generated via torch.rand_like and with the word embedding "a high quailty photo". Here are some samples I genereated using iterative denoising.

1.6 Classifier Free Guidance (CFG)

Some of the images generated by iterative denoising seem really random or confusing. To fix this, we will use Classifier Free Guidance, which uses an conditional and unconditional noise estimate the new noise.

\[\epsilon = \epsilon_u + \gamma(\epsilon_c - \epsilon_u)\]

For these images, I used "a high quality photo" for the UNet embedding that would estimate conditional noise and a null prompt of "" as the unconditional noise. Furthermore, I used $\gamma=7$ when calculating the overall noise estimate.

1.7 Image to Image Translation

Instead of passing in a randomly generated image, we will pass in a noise-ified image (using forward(img, t)) of the original image at different timesteps in order to get the diffusion model to output something similar to the original image we noise-ified.

Side Note: I used a strided_timesteps array that ranged from [990, 0] with a stride=30. When i_start=0, t=990, which the timestep at which forward(img, t) would return the noisiest version of the original image.

campanile.jpg

i_start=0

i_start=3

i_start=5

i_start=7

i_start=10

i_start=20

nyc.jpg

i_start=0

i_start=3

i_start=5

i_start=7

i_start=10

i_start=20

sf.jpg

i_start=0

i_start=3

i_start=5

i_start=7

i_start=10

i_start=20

1.7.1 Hand Drawn and Web Images

Let’s see if CFG with DeepFloyd runs well on hand drawn images and images taken from the web!

web: jinx.jpg

i_start=0

i_start=3

i_start=5

i_start=7

i_start=10

i_start=20

jinx.jpg

hand drawn: pikachu?

i_start=0

i_start=3

i_start=5

i_start=7

i_start=10

i_start=20

pikachu.jpg

hand drawn: ditto?

i_start=0

i_start=3

i_start=5

i_start=7

i_start=10

i_start=20

ditto.jpg

1.7.2 Inpainting

We can use a mask and only pass in the mask portion through the forwarding process such that the diffusion model will only generate within the masked area.

\[x_t = \textbf{m} x_t + (1 - \textbf{m})\text{forward}(x_{orig}, t)\]

campanile.jpg

campanile.jpg

mask

to replace

inpainted

nyc

nyc.jpg

mask

to replace

inpainted

sh

sh.jpg

mask

to replace

inpainted

1.7.3 Text Conditional Image to Image Translation

We are going to run the image translation again, but we’ll replace the generic embedding "a high quality photo" into a specific prompt. The generated models will look more like either the prompt or the original image passed into the model depending on how noisy the initial forwarding process is.

"a rocket ship" $\longrightarrow$ campanile.jpg

i_start=0

i_start=3

i_start=5

i_start=7

i_start=10

i_start=20

campanile.jpg

"a lithograph of waterfalls" $\longrightarrow$ nyc.jpg

i_start=0

i_start=3

i_start=5

i_start=7

i_start=10

i_start=20

nyc.jpg

"an oil painting of a snowy mountain village" $\longrightarrow$ sf.jpg

i_start=0

i_start=3

i_start=5

i_start=7

i_start=10

i_start=20

sf.jpg

1.8 Visual Anagrams

We can create optical illusions with diffusion models by using the Visual Anagrams algorithm presented by this paper. Basically, we take two images and generate their CFG noise and then combine the noise two noises. However, one of the images must be flipped and then flipped again to generate an optical illusion that can be seen when the image is flipped. For this project, I just flipped along the x-axis (index 2 of the tensor) using torch.flip.

\[\epsilon_1 = \text{UNet}(x_t, t, p_1)\] \[\epsilon_2 = \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2))\] \[\epsilon = (\epsilon_1 + \epsilon_2) / 2\]

Here are some examples:

old man

campfire

rocket ship

snowy mountain village

dog

waterfall

1.9 Hybrid Images

We can also create hybrid images by calculating the CFG noise of the two images and then combining the low frequency of one image with the high frequency of another image as demonstrated with this paper on Factorized Diffusion.

\[\epsilon_1 = \text{UNet}(x_t, t, p_1)\] \[\epsilon_2 = \text{UNet}(x_t, t, p_2)\] \[\epsilon = f_\text{lowpass}(\epsilon_1) + f_\text{highpass}(\epsilon_2)\]

Here are some examples:

skull + waterfall

yin and yang + flowers

panda + sunset

proj5a reflection

I really enjoyed this project as it was my first time using a diffusion model. It was fun creating hybrid and anagram images. I learned a lot about how diffusion models work and hopefully I could do a deeper dive into diffusion models with 5b.

Proj5b: Creating a Diffusion Model

Part 1: Unconditional UNet

Modern diffusion models uses UNet architecture. Below is how the UNet architecture is structured.

I used the Unconditional UNet to train a denoiser on the MNIST dataset with batch_size=256 over 5 epochs. The UNet had D=128 hidden layers and we optimized the MSE loss function using the ADAM optimizer with a learning rate of 1e-4. Furthermore, I trained a denoiser with sigma=0.50 applied to the images.

Here’s the training loss log-scaled graph.

And here are some sample outputs of the model after the first and fifth epoch.

After Epoch 1

After Epoch 5

Let’s also see how well a sigma=0.5 trained denoiser would work on other $\sigma$ noisy images.

The results are okay, but it can definitely look much better, especially when the input image has a lot of noise added to it.

Part 2: Time-Conditioned UNet

In order to create a time conditioned UNet, we have to add some fully connected blocks to the unconditional UNet such that we can use timesteps affect some stages of the UNet to produce a time-conditioned result.

Here’s the Time Conditioned UNet structure.

I used the Time-Conditioned UNet to train a noise estimator on the MNIST dataset with batch_size=128 over 20 epochs. The UNet had D=64 hidden layers and we optimized the MSE loss function using the ADAM optimizer with an initial learning rate of 1e-3 which would then decrease after each epoch.

Here’s the training loss log-scaled graph.

And here are some sample outputs by running the model on random noise after the fifth and twentieth epoch.

After Epoch 5

After Epoch 20

The generation of the hand-written numbers of random numbers is not bad looking, but definitely could look better. Furthermore, the numbers are generated in a random order based on the time step. We can imporve these results by using a Class-Conditioned UNet.

Part 3: Class Conditioned UNet

This time, instead of only passing in a timestep scalar into the FCBlocks, we will also be passing in some class labels into the FCBlocks. The resulting block from inputting class labels would then be multipliled element wise into the affected block (i.e. Unflatten) rather than added to it like the timestep parameter. This is to ensure that only a particular class label can generate a certain result. Also, another thing to note is that we have to pass an One Hot Encoding of each class label into the FCBlock because we are technically plugging in categorical data into the Neural Network which needs to be interpreted as numbers.

Here’s the Class Conditioned UNet structure.

I used the Class-Conditioned UNet to train a noise estimator on the MNIST dataset with batch_size=128 over 20 epochs. The UNet had D=64 hidden layers and we optimized the MSE loss function using the ADAM optimizer with an initial learning rate of 1e-3 which would then decrease after each epoch. The only difference this time is that I will also be passing in the training labels of each image along with the timestep parameter into the model.

Here are some sample outputs by running the model on random noise after the fifth and twentieth epoch.

After Epoch 5

After Epoch 20

We can see that by the 5th epoch, the results generated from random noise is already looking much better than the results generated by the Time-Conditioned UNet by the 20th epoch.

proj5b: reflection

This project was pretty fun as it was one of my first hands on experience with pytorch model training. I learned a lot how Nerual Networks worked and in particular the UNet structure and diffusion models.

back to project list