The Path To Undestand Image Generation and Stable Diffusion
The rise of image generation comes to my attention in 2022 where stability.ai releases stable diffusion, generating a stunning image. I have known image generation for a long time that starts from variational autoencoder and generative adversarial networks (GANs) and a breakthrough of StyleGAN2 which release thispersondoesnotexists.com *you should visit that.
I learned GANs using Keras and a book named GANs in Action. I remember when I tried to make a random noise and feed it into a generator network that produce handwritten digits using the MNIST dataset. It was successful after hours of training two networks. In another experiment, I tried to train GANs to produce an outer box line for the digits using my laptop. It was successful and stopped from there and never came back again.
Until now, there was something called stable diffusion and SDXL that gained so much attention from the community and there was also Midjourney which produced incredible output. I tried to generate images using stable diffusion in 2023 when there was a platform or an API that was so handy to use and ultimately it was free. Replicate provides an SDXL model and host model inference to play with the model. I tried several prompts, and configurations and saw some output. However, the image was no longer a surprise because I had already seen on Twitter and a couple of images of my friend. I try to understand the prompt like the writing of the object, background, color, etc. It helps to produce the image that I expected.
In this article, I list several concepts that are used in stable diffusion by looking at the paper and explanations from articles and blog posts. I hope that the resources listed below will help you to understand the concept and be hands-on on how to build it.
Variational Autoencoder
Variational Autoencoder (VAE) is a type of generative model that uses a probabilistic approach to map the input into latent space which down-scaled the dimensionality of the input. Then, the latent space is mapped out to predict the output that is similar to its training data.
https://mbernste.github.io/posts/vae/
https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73
Generative Adversarial Network
A Generative Adversarial Network (GAN) is a type of deep learning architecture that consists of two models, a Generator and a Discriminator. The generator, as its name suggests, generates new sample data that is learned from trained distribution. The discriminator model learns to distinguish the generated sample and the real sample. Read more
Stable Diffusion
Finally, stable diffusion models are flexible models that can do multiple tasks such as text-to-image, image inpainting, and super-resolution. In this post, I briefly explain more on a text-to-image part which has an interesting part on the combination of text and image decoder. A latent text-to-image diffusion model consists of multiple models to produce an image. From text encoder, which uses a frozen CLIP ViT-L/14, an image generator that consists of two networks are Image Information Creator + Algorithm scheduling and Image Decoder. The image information creator uses UNet architecture which receives input from the text encoder and later produces an output that is fed into the image decoder. read more