Diffusion model shenanigans. Using DeepFloyd IF diffusion model trained by Stablility AI. This model takes in 64x64
images and produces 64x64
images from its first stage. I did not upsample the images into 256x256
images using the second stage of the model due to lack of google colab credits :(. The first part will be going through the implementation of a diffusion model and its steps while the second part will be implementing some cool results from some fairly recent papers.
I used SEED=501
.
Here some of the ouput images after passing it through stage 1 and 2 UNets of the model. It appears that the number of iterative steps the model takes affects the output image pretty drastically even though the same word embedding was used.
10 steps
20 steps
40 steps
We need to implement a function that adds noise to an image. This is achieved with this formula:
\[x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1 - \bar\alpha_t}\epsilon, \space where \space \epsilon \sim N(0, 1)\]We are using a noise generator (or estimation) using a standard normal distirbution $\epsilon$, which can be calculated via torch.randn_like
, and an alpha_cumprod
$\bar\alpha_t$ of $t$ step. As $t$ increases, so does the amount of noise added to the image increase.
campanile.jpg |
t=250 |
t=500 |
t=750 |
After noise-ifying images, we can train a diffusion model to estimate denosising processes of these noisified image. That way the diffusion model can later “generate” images by converting random noisy images not on the real image manifold into images related to the input fed into the model.
We will first try to implement a classical denoising method via Gaussian Blur Filtering. In particular, I used torchvision.trnasforms.functional.gaussian_blur
with a kernel size of (7, 7)
to implement the blurs on the noisy images. This will pass the noisy image through a low pass filter and therefore get rid of some of the low frequency noise. However, as seen by the results, the Gaussian Blur Filter doesn’t get rid of all the noise and it also blurs the original image.
t=250 |
t=500 |
t=750 |
blur t=250 |
blur t=500 |
blur t=750 |
We can further improve the denoising by using a pretrained diffusion model to estimate the noise in the new noisy image and then remove that estiamted noise from that same noisy image to get closer towards the original image. Since DeepFloyd was trained on text conditioning, we use the first stage UNet on the condition of "a high quality photo"
.
In comparison to the Gaussian Blur Filter, this method of denoising gets rid of all the noise. However, the predicted image still tends to be blurred and loeses some of the structure and detailes that were in the original image.
t=250 |
t=500 |
t=750 |
one-step t=250 |
one-step t=500 |
one-step t=750 |
Another method of denoising we can use is iterative denoising, the default denoising method used by diffusion models. It would be tedious and expensive to go through each step, espcially if $T$ is very large. Therefore, we iterate through some strided_timesteps
with strides=30
. The formula is given below with $t$ being the current timestep and $t’$ being an earlier timestep such that $t’ < t$.
$x_0$ is the estimated clean image at each iterative step using the formula used in the forward process with noise $\epsilon$ being the estimated noise from UNet output.
t=90 |
t=240 |
t=390 |
t=540 |
t=690 |
campanile.jpg |
iterative |
one-step |
gaussian blur |
We are going to generate images from scratch by starting the iterative denoising at $T$ timestep (the max timestep) and feeding the model a random noisy image generated via torch.rand_like
and with the word embedding "a high quailty photo"
. Here are some samples I genereated using iterative denoising.
Some of the images generated by iterative denoising seem really random or confusing. To fix this, we will use Classifier Free Guidance, which uses an conditional and unconditional noise estimate the new noise.
\[\epsilon = \epsilon_u + \gamma(\epsilon_c - \epsilon_u)\]For these images, I used "a high quality photo"
for the UNet embedding that would estimate conditional noise and a null prompt of ""
as the unconditional noise. Furthermore, I used $\gamma=7$ when calculating the overall noise estimate.
Instead of passing in a randomly generated image, we will pass in a noise-ified image (using forward(img, t)
) of the original image at different timesteps in order to get the diffusion model to output something similar to the original image we noise-ified.
Side Note: I used a
strided_timesteps
array that ranged from[990, 0]
with astride=30
. Wheni_start=0
,t=990
, which the timestep at whichforward(img, t)
would return the noisiest version of the original image.
i_start=0 |
i_start=3 |
i_start=5 |
i_start=7 |
i_start=10 |
i_start=20 |
i_start=0 |
i_start=3 |
i_start=5 |
i_start=7 |
i_start=10 |
i_start=20 |
i_start=0 |
i_start=3 |
i_start=5 |
i_start=7 |
i_start=10 |
i_start=20 |
Let’s see if CFG with DeepFloyd runs well on hand drawn images and images taken from the web!
i_start=0 |
i_start=3 |
i_start=5 |
i_start=7 |
i_start=10 |
i_start=20 |
jinx.jpg |
i_start=0 |
i_start=3 |
i_start=5 |
i_start=7 |
i_start=10 |
i_start=20 |
pikachu.jpg |
i_start=0 |
i_start=3 |
i_start=5 |
i_start=7 |
i_start=10 |
i_start=20 |
ditto.jpg |
We can use a mask and only pass in the mask portion through the forwarding process such that the diffusion model will only generate within the masked area.
\[x_t = \textbf{m} x_t + (1 - \textbf{m})\text{forward}(x_{orig}, t)\]
campanile.jpg |
mask |
to replace |
inpainted |
nyc.jpg |
mask |
to replace |
inpainted |
sh.jpg |
mask |
to replace |
inpainted |
We are going to run the image translation again, but we’ll replace the generic embedding "a high quality photo"
into a specific prompt. The generated models will look more like either the prompt or the original image passed into the model depending on how noisy the initial forwarding process is.
"a rocket ship"
$\longrightarrow$ campanile.jpg
i_start=0 |
i_start=3 |
i_start=5 |
i_start=7 |
i_start=10 |
i_start=20 |
campanile.jpg |
"a lithograph of waterfalls"
$\longrightarrow$ nyc.jpg
i_start=0 |
i_start=3 |
i_start=5 |
i_start=7 |
i_start=10 |
i_start=20 |
nyc.jpg |
"an oil painting of a snowy mountain village"
$\longrightarrow$ sf.jpg
i_start=0 |
i_start=3 |
i_start=5 |
i_start=7 |
i_start=10 |
i_start=20 |
sf.jpg |
We can create optical illusions with diffusion models by using the Visual Anagrams algorithm presented by this paper. Basically, we take two images and generate their CFG noise and then combine the noise two noises. However, one of the images must be flipped and then flipped again to generate an optical illusion that can be seen when the image is flipped. For this project, I just flipped along the x-axis (index 2 of the tensor) using torch.flip
.
Here are some examples:
old man |
campfire |
rocket ship |
snowy mountain village |
dog |
waterfall |
We can also create hybrid images by calculating the CFG noise of the two images and then combining the low frequency of one image with the high frequency of another image as demonstrated with this paper on Factorized Diffusion.
\[\epsilon_1 = \text{UNet}(x_t, t, p_1)\] \[\epsilon_2 = \text{UNet}(x_t, t, p_2)\] \[\epsilon = f_\text{lowpass}(\epsilon_1) + f_\text{highpass}(\epsilon_2)\]Here are some examples:
skull + waterfall |
yin and yang + flowers |
panda + sunset |
I really enjoyed this project as it was my first time using a diffusion model. It was fun creating hybrid and anagram images. I learned a lot about how diffusion models work and hopefully I could do a deeper dive into diffusion models with 5b.
Modern diffusion models uses UNet architecture. Below is how the UNet architecture is structured.
I used the Unconditional UNet to train a denoiser on the MNIST dataset with batch_size=256
over 5 epochs. The UNet had D=128
hidden layers and we optimized the MSE loss function using the ADAM optimizer with a learning rate of 1e-4
. Furthermore, I trained a denoiser with sigma=0.50
applied to the images.
Here’s the training loss log-scaled graph.
And here are some sample outputs of the model after the first and fifth epoch.
Let’s also see how well a sigma=0.5
trained denoiser would work on other $\sigma$ noisy images.
The results are okay, but it can definitely look much better, especially when the input image has a lot of noise added to it.
In order to create a time conditioned UNet, we have to add some fully connected blocks to the unconditional UNet such that we can use timesteps affect some stages of the UNet to produce a time-conditioned result.
Here’s the Time Conditioned UNet structure.
I used the Time-Conditioned UNet to train a noise estimator on the MNIST dataset with batch_size=128
over 20 epochs. The UNet had D=64
hidden layers and we optimized the MSE loss function using the ADAM optimizer with an initial learning rate of 1e-3
which would then decrease after each epoch.
Here’s the training loss log-scaled graph.
And here are some sample outputs by running the model on random noise after the fifth and twentieth epoch.
The generation of the hand-written numbers of random numbers is not bad looking, but definitely could look better. Furthermore, the numbers are generated in a random order based on the time step. We can imporve these results by using a Class-Conditioned UNet.
This time, instead of only passing in a timestep scalar into the FCBlocks, we will also be passing in some class labels into the FCBlocks. The resulting block from inputting class labels would then be multipliled element wise into the affected block (i.e. Unflatten) rather than added to it like the timestep parameter. This is to ensure that only a particular class label can generate a certain result. Also, another thing to note is that we have to pass an One Hot Encoding of each class label into the FCBlock because we are technically plugging in categorical data into the Neural Network which needs to be interpreted as numbers.
Here’s the Class Conditioned UNet structure.
I used the Class-Conditioned UNet to train a noise estimator on the MNIST dataset with batch_size=128
over 20 epochs. The UNet had D=64
hidden layers and we optimized the MSE loss function using the ADAM optimizer with an initial learning rate of 1e-3
which would then decrease after each epoch. The only difference this time is that I will also be passing in the training labels of each image along with the timestep parameter into the model.
Here are some sample outputs by running the model on random noise after the fifth and twentieth epoch.
We can see that by the 5th epoch, the results generated from random noise is already looking much better than the results generated by the Time-Conditioned UNet by the 20th epoch.
This project was pretty fun as it was one of my first hands on experience with pytorch model training. I learned a lot how Nerual Networks worked and in particular the UNet structure and diffusion models.