Harnessing the Power of AI: A Comprehensive Review of Image Synthesis with a
Focus on Diffusion Models
Bowen Xue
University of Science and Technology of China
xbwcyt@mail.ustc.edu.cn
Abstract
AI image synthesis has experienced a leap from its in-
ception to rapid development over this decade. We are
delighted to see that, with the enhancement of computing
power and model architectures, our image generation mod-
els have progressed from initially producing images of con-
cerning quality [19] to now being able to generate high-
resolution, perfect images in just a few seconds. [32] In this
article, I will enumerate the pioneering achievements in the
field of image generation and focus on discussing the dif-
fusion model framework used by current mainstream image
generation models. I will also showcase my personal fine-
tuning results and experiences, explore the current short-
comings, and discuss potential improvements for future im-
age generation models.
1. Introduction
Before generative models came into prominence, computers
had already learned to perform efficient image recognition
by leveraging deep learning. [21] Our computers possessed
good vision, but at that time, they had little artistic capabil-
ity. As Feynman said, ”What I cannot create, I do not under-
stand.” Recognition and generation complement each other,
with models internalizing the essence of data through far
fewer model parameters than the amount of training data.
If a model clearly understands the essence of data, it can
also generate data. The first to break this deadlock was the
Variational Autoencoder (VAE)[19].
1.1. Variational Autoencoder
From a mathematical perspective, what image generation
models do is actually estimating the distribution of real data
in image space.Fig. 3
Variational auto-encoder are a type of auto-
encoderFig. 1, where an auto-encoder is a neural network
used to learn efficient encodings of data[12]. The goal
of an auto-encoder is to compress input data through a
neural network and then reconstruct data that is as close as
possible to the original input. Variational auto-encoder not
only learn a compressed representation of data encoding,
but also learn the probabilistic distribution of the data
encoding, allowing us to sample from the distribution to
generate new data points.
Figure 1. auto-encoder structure
VAE builds upon the traditional autoencoder by adding
Gaussian noise to the encoder’s output, which allows the
decoder to be robust to noise, thereby improving the quality
of the generated results.
VAE is still an unsupervised method, making it diffi-
cult to control the output of the decoder. Conditional VAE
(CVAE) introduces conditional variables based on VAE,
turning it into a supervised learning method.[36] In CVAE,
the operations of the encoder and decoder depend not only
on the input data but also on additional conditional vari-
ables, allowing us to make the decoder generate specific
images based on different conditional variables.
The VAE has completed the task of generating images[5,
40]; however, the VAE did not learn how to produce a real-
istic image, but instead generates images as similar as pos-
sible to those in the training set. Due to this limitation, Gen-
erative Adversarial Nets(GAN)[9] were proposed.
1.2. Generative Adversarial Nets
GAN views the training process as a game between two
separate networks: a generator network and a discrimina-
tor network that tries to classify samples. Whenever the
discriminator notices a difference between the model distri-
1
bution of the generator and the real distribution, the gener-
ator slightly adjusts its parameters to make this difference
disappear, until finally the generator accurately reproduces
the real data distribution and the discriminator guesses ran-
domly, unable to discern the differences.Fig. 2
Figure 2. GAN structure
We can naturally use the decoder of a VAE as the gen-
erator for a GAN, while using a CNN as the discriminator.
In GANs, the training of the generator and discriminator is
a dynamic game-theoretic process. If there is a mismatch
in capabilities between the generator and discriminator, it
may lead to training instability, resulting in mode collapse
or non-convergence. Therefore, it is crucial to choose ap-
propriate structures for both the generator and discrimina-
tor.
DCGAN[28] introduces deconvolution layers into the
generator, where the deconvolution (transposed convolu-
tion) layers gradually enlarge the feature maps to produce
clearer images. PGGAN[16] progressively increases the
complexity of the generator and discriminator during the
iteration process. It begins training at low resolution and
gradually adds layers to generate high-resolution images.
In the original GAN, Jensen-Shannon divergence (JS di-
vergence) is chosen to measure the difference between the
real data distribution and the generated data distribution.
However, JS divergence is not a good measure when there
is little overlap between the distributions, as it can lead to
an overly powerful discriminator, causing the generator’s
gradients to vanish and affecting the stability of GAN train-
ing. WGAN[2] proposes the use of the Wasserstein distance
(also known as the Earth-Mover distance) as an alternative
to JS divergence. Wasserstein distance provides a smoother
gradient, even when there is minimal overlap between the
generated and real data distributions.
Just like the core idea of the previously mentioned
CVAE, we can also introduce conditional variables into
GANs, allowing GANs to benefit from human-annotated
samples. This is what CGAN[8] is.
Although the architecture of GANs has been continu-
ously upgraded to generate high-quality images, the inher-
Figure 3. The image illustrates the mathematical concepts of an
image generation model.
ent model structure makes it difficult to train[10, 24]. The
dynamic balance between the generator and the discrimi-
nator is hard to maintain, and the adversarial training in-
volving two networks makes it challenging for the model
to reach a state of convergence. The training losses of the
generator and discriminator are not traditional signals that
can clearly indicate optimization progress, making it diffi-
cult to determine the criteria for monitoring and terminating
the training process. This directly led to the emergence of
diffusion models.
2. Related Work
The diffusion model[13] abandons the innovative structure
of GANs: the discriminator, focusing on the generative
model. The diffusion model learns to predict the noise in
images through a process of continually adding noise. The
process of generating images in a diffusion model, which is
also a denoising process, is akin to the famous quote by the
sculptor Michelangelo: ”The sculpture is already complete
within the marble block, before I start my work. It is already
there, I just have to chisel away the superfluous material.
2.1. Diffusion model
Diffusion models[13, 23, 35, 38] are primarily inspired by
the diffusion processes in physics. In the context of gener-
ative models, diffusion models simulate a reverse process:
starting from a disordered random noise state, they gradu-
ally ”denoise”Fig. 4 to generate structured data. This pro-
cess is achieved through multiple iterative steps, each alter-
nating between adding noise to the data and denoising it,
until transitioning from the initial noisy state to a clear data
output[6].
The core mathematical algorithm of the diffusion model
is very concise. We train the model to acquire the ability to
denoise images at any stage. During the sampling process,
we can start with initial noise and continuously denoise in
cycles until a satisfactory image is generated. This gives
the diffusion model excellent interpretability, allowing us
to adjust or stop at any step of the denoising process.
The performance of the diffusion model is quite impres-
sive; however, for all generative models, high resolution re-
mains a bottleneck. This is because an image represented
2
Figure 4. denoiser structure
Algorithm 1 Training
1: repeat
2: x
0
q(x
0
)
3: t Uniform({1, . . . , T })
4: ϵ N(0, I)
5: Take gradient descent step on
θ
ϵ ϵ
θ
(
¯α
t
x
0
+
1 ¯α
t
ϵ, t)
2
6: until converged
Algorithm 2 Sampling
1: x
T
N(0, I)
2: for t = T, . . . , 1 do
3: z N(0, I) if t > 1, else z = 0
4: x
t1
=
1
α
t
x
t
1α
t
1¯α
t
ϵ
θ
(x
t
, t)
+ σ
t
z
5: end for
6: return x
0
in RGB with a width of m pixels and a height of n pixels
has a data shape of m*n*3. For a high-resolution image, the
amount of information it contains is enormous, making it
difficult to process such data directly.
Although advanced sampling strategies[20, 34, 37] and
hierarchical methods[14, 41] can accelerate inference speed
in pixel space, training on high-resolution image data al-
ways requires computationally expensive gradients.
2.2. Latent Diffusion model
For this reason, Latent Diffusion Models[31] have been pro-
posed. Unlike traditional diffusion models, which operate
directly in high-dimensional data spaces, Latent Diffusion
Models first encode data into a lower-dimensional latent
space. Within this latent space, the model can learn and sim-
ulate data distributions more effectively, as the latent space
provides a more abstract and compressed representation of
the data. This typically results in the generation of higher-
quality samples.
The latent diffusion model primarily trains an encoder
that encodes from pixel space into latent space, a denoiser
that performs noise reduction in latent space, and a decoder
that restores from latent space back to pixel space.
Semantic
Map
crossattention
Latent Space
Conditioning
Text
Diffusion Process
denoising step switch skip connection
Repres
entations
Pixel Space
Images
Denoising U-Net
concat
Figure 5. This is the framework used by Stable Diffusion 1.5.
2.3. VQ-VAE
We often use VQ-VAE[42] as a bridge between pixel space
and latent space. VQ-VAE is a special type of VAE that
incorporates vector quantization (VQ) into the traditional
VAE framework. Vector quantization is a technique that
maps continuous input vectors to the nearest vector (known
as codebook vectors) in a finite set of vectors. In VQ-VAE,
this technique is used to discretize the latent space, mean-
ing the continuous representations output by the encoder are
mapped to the nearest vector in a predefined, fixed-size set
of codebook vectors.
Discrete latent spaces provide more stable representa-
tions, which are beneficial for generative models to learn
and understand the high-level structures of data.[30] In con-
trast, the continuous latent spaces of traditional VAEs may
lead to unstable generation quality due to the ”holes” prob-
lem in the latent space (i.e., some regions of the latent space
do not correspond to valid data).
2.4. U-net
Figure 6. This is a very simple demonstration of the U-net archi-
tecture, where Dblock represents the process including downsam-
pling, and Ublock represents the process including upsampling.
I use images in pixel space as examples for ease of understand-
ing, though in reality, U-net processes encodings within the latent
space.
In the Stable Diffusion Model 1.5, we use a symmetric
3
U-Net architecture where the input and output dimensions
are equal. Fig. 6This means that for every downsampling
step in the encoding process, there is a corresponding up-
sampling step. Overall, this ensures that the data shape re-
mains unchanged before and after denoising. In detail, we
employ skip connections between the encoder and decoder
at the same stages. The role of these skip connections is
to directly transfer the feature maps from the corresponding
stage of the encoder to the decoder during each decoding
stage. This helps to recover details that may be lost dur-
ing the downsampling process, thereby better preserving the
original input’s fine features during image reconstruction.
The diffusion model uses the name ”U-Net” following
the original U-Net paper[33], but its architecture has signif-
icantly changed. The latent diffusion model introduces an
attention mechanism in the U-Net, where data goes through
a Spatial Transformer [31]during both downsampling and
upsampling processes. In this Transformer layer, we utilize
the attention mechanism[43]. The self-attention mechanism
allows the model to consider information from all other in-
puts when processing a single input (such as a part of an
image). In image processing, this means the network can
automatically emphasize important areas of the image and
suppress unimportant parts, dynamically adjusting the sig-
nificance of each area based on global content. The cross-
attention mechanism, applied when conditional variables
are input, enables the model to generate corresponding im-
age parts based on the content described in the text, effec-
tively blending textual and visual information to enhance
the relevance and accuracy of the generated images.
2.5. LoRA
Pretrained
Weights

x
h

Pretrained
Weights

x
f(x)
Figure 7. Diagram of LoRA[15]
Figure 8. This is a sample image of a Chinese landscape painting
generated by the Stable Diffusion 1.5 model, which I fine-tuned
using the LoRA method. It is combined with AnyText to generate
specific Chinese characters at specific locations.
LoRA[15] (Low-Rank Adaptation) is an efficient model
fine-tuning technique designed to endow pre-trained models
with specific functionalities by adjusting them. This method
achieves efficient compression of parameters by decompos-
ing the parameters to be updated (also known as parame-
ter residuals) into the product of two low-rank matrices. In
LoRA, training only these two low-rank matrices suffices
for the fine-tuning of the model, significantly reducing com-
putational costs while maintaining model performance.
LoRA fine-tuning differs from full model fine-tuning in
that we can freely choose the method of matrix decompo-
sition to increase or decrease the amount of parameters we
adjust. This allows us to fine-tune the model with signifi-
cantly lower memory usage and computational power com-
pared to full model fine-tuning.[44, 45]
In my fine-tuning of diffusion, I chose to use the LoRA
method to fine-tune the Spatial Transformer in the U-net of
SD1.5. SD1.5 is a text-to-image pretrained model based on
the LDM structure, trained and open-sourced by Stability
AI. Fig. 8 shows the results I obtained from fine-tuning, and
I will explain the fine-tuning process in detail and present
more results later.
2.6. CLIP
CLIP (Contrastive Language–Image Pre-training) [29] is an
innovative multimodal model developed by OpenAI that
leverages a contrastive learning mechanism to process and
understand the associations between images and text. By
learning from a large number of image-text pairs, the model
uses contrastive learning to enhance its capability of pro-
ducing unified semantic embeddings for both images and
text.
In the SD1.5 architecture, we utilize the Text Encoder
from CLIP to convert the prompt, which controls image
generation, into embeddings that are applied to the cross-
attention layer[11] of the Spatial Transformer.Through the
cross-attention layers, control variables can manage all the
4
Figure 9. Diagram of CLIP[29]
denoising processes[3, 4, 7, 17, 18], ultimately generating
images that conform to the control variables.
2.7. Controlnet
Text
Encoder
Prompt c
t
×3
×3
×3
×3
Output ϵ
θ
( z
t
, t, c
t
, c
f
)
SD Decoder Block A
64×64
SD Decoder Block B
32×32
SD Decoder Block C
16×16
SD Decoder
Block D 8×8
Time
Encoder
Time t
×3
×3
×3
×3
Input z
t
SD Encoder Block A
64×64
SD Encoder Block B
32×32
SD Encoder Block C
16×16
SD Encoder
Block D 8×8
SD Middle
Block 8×8
×3
×3
×3
×3
zero convolution
Condition c
f
+
×3
×3
×3
zero convolution
zero convolution
zero convolution
zero convolution
×3
zero convolution
SD Encoder Block A
64×64 (trainable copy)
SD Encoder Block B
32×32 (trainable copy)
SD Encoder Block C
16×16 (trainable copy)
SD Encoder Block D
8×8 (trainable copy)
SD Middle Block
8×8 (trainable copy)
Prompt&Time
(a) Stable Diusion (b) ControlNet
Figure 10. Diagram of controlnet[46]
In addition to using text as a conditional variable, we
can also input various types of images as control conditions,
such as Canny edges, Hough lines, user scribbles, human
key points, segmentation maps, shape normals, depths, etc.
This allows image generation to be finely controlled, thanks
to the powerful control capabilities of Controlnet[46].
The core idea of ControlNet is to train an additional neu-
ral network on top of a pre-trained model. This architecture
views large pre-trained models as a powerful support for
learning diverse conditional controls. Trainable replicas and
the original locked model are connected through zero con-
volutional layers, with weights initialized to zero to allow
gradual growth during training. This architecture ensures
that no harmful noise is added to the deep features of the
large diffusion model at the start of training, and protects
the large-scale pre-trained support in the trainable replica
from such noise.
2.8. Anytext
Generating accurate text has always been a major challenge
for diffusion models. Thanks to the fine control demon-
strated by ControlNet over generative models, the work on
AnyText[39] has further fine-tuned Text ControlNet. This
effort has successfully enabled diffusion models to generate
specific text in designated areas.
v
v
v
!
!
"
!
"
"
"
#
Diffusion
$
"
"
"
Time
Step
%
VAE
VAE
Auxiliary Latent Module
UNet
Tex t
Control
Net
Prompt (y)
Photo of a lush estate with a sign reads "BEVERLY" and "HILLS"
To ke ni ze r !
Glyph Lines !
!
OCR
Encoder!"
!
Embeddings
Text Embedding Module
"
Text Perceptual Loss &
$%
#
"
position
#′
"
position
Perceptual Loss
!′
!
Text-control Diffusion Pipeline
!
Learnable models
Frozen models
Auxiliary Latent
Text Embedding
Text Perceptual
To ke n Re pl ac e
Glyph "
!
Fuse Layer
Masked Image "
"
!
Te xt E nc o de r "
!
OCR
Encoder#$
#
Glyph
Render
Position "
$
Rand #
Glyph
Render
Glyph
Block
!
Pos.
Block
!
#
"#
(%, %
"
)
!
%&
'
$
VAE
OCR
Encoder#$
#
Linear (
!
Figure 11. Diagram of AnyText[39]
The method for fine-tuning Text ControlNet is quite in-
genious. In order to let the model accurately know the text
we want, we first render the corresponding glyph using a
font, then input it into a Linear layer through an OCR en-
coder. This is then entered into the embeddings along with
the prompt corresponding to the text.
For specifying the generation location, we can also ren-
der the glyph in a specific mask using a font. This, along
with the mask, is then input into a mixing layer as the input
for the ControlNet.
Thus, while diffusion models are capable of generating
images, we can also input various conditional variables to
finely control the model’s output, even enabling the model
to generate precise, specific text at designated locations.
5
3. LoRA on Diffusion model
We have already reviewed the development history of diffu-
sion models and introduced the state-of-the-art techniques
related to diffusion models. Next, I will demonstrate the
excellent generalization capabilities and ease of fine-tuning
of diffusion models through experiments. I will showcase
the capabilities of diffusion models to everyone.
Figure 12. trainset
I have collected 35 images of Chinese landscape paint-
ings from Harvard University’s online art center, the Art
Center at Metropolitan State University, the Princeton Uni-
versity Art Museum, and the Smithsonian Institution as a
training set.
My collection criteria are that the images are clear and
have distinct landscape painting features. I have cropped all
of them to a size of 512x512 pixels and have not performed
any image enhancement beyond that.
Next, I used CLIP to reverse-engineer the prompt for
each image, annotating each training material to simulta-
neously train the text encoder. Through multiple training
sessions, I discovered a labeling technique: to only anno-
tate the corresponding items in the pictures rather than the
layout related to the painting style. This prevents both the
text encoder and the U-net from affecting the style simulta-
neously, which could lead to chaotic results.
I simultaneously trained the U-net and the text encoder
on top of the Stable Diffusion 1.5 base model. I set the
learning rate for training the U-net at 1e-4 and the text en-
coder at 1e-5, as the text encoder is prone to overfitting.
I used ’cosine with restarts’ as the learning rate scheduler
and ’AdamW8bit’ as the optimizer. The training was con-
ducted for 20 epochs, with each image being learned five
times per epoch. I saved the weights after every two epochs.
Figure 13. loss
After 3500 steps of training, we observed that our loss
curve formed beautiful oscillations. Similar to other gener-
ative models, our loss fluctuated between 0.1 and 0.2, indi-
cating that our model has not overfitted.
I used an NVIDIA RTX 4080 12GB for both training
and generation processes. I employed ’tree, water, moun-
tain’ as positive prompts and adopted the DPM++ 2M Kar-
ras sampling method. Each image underwent 30 iterations,
with generated image resolution set at 512x512. Using the
fine-tuned model, I continuously generated 25 sample im-
ages Fig. 14, demonstrating that our model has effectively
learned the style of Chinese landscape paintings and pos-
sesses strong generalization capabilities.
Figure 14. output
6
3.1. LoRA on Diffusion Model with AnyText
If we pay attention to the details, we can notice a flaw in the
images we generate: the images contain unintelligible text.
One approach is to remove all text from the training set, but
we can also use Anytext to generate meaningful text. Note
that during training, Anytext freezes the parameters of U-
net and only trains the text controlnet. The fine-tuning we
perform on U-net with LoRA does not affect the effective-
ness of the fine-tuning on the text controlnet.
Therefore, using the same parameter file, I fine-tuned the
U-net with the LoRA method while Anytext was fine-tuning
the controlnet of SD1.5.
Figure 15. output with anytext
After loading the model file I trained using the LoRA
method and Anytext, I found that its performance matched
my expectations. I generated distinctly styled texts with
special meanings in the top-left corner. The model success-
fully created Chinese landscape paintings with calligraphic
capabilities.
I have open-sourced the model file I trained on GitHub
and written a detailed README to guide its use.
4. Conclusion
Making AI omnipotent has always been the dream of artifi-
cial intelligence scientists. Starting from the origins of gen-
erative models, we discuss up to the diffusion model, enu-
merating the key technologies and innovations that play a
crucial role, marveling at the achievements reached by gen-
erative models, and demonstrating their power through ex-
periments with fine-tuned models. At the end of the article,
I personally tried the entire process of training, generating,
and improving a diffusion model, and succeeded. We have
seen the great potential and prospects of generative mod-
els.
5. Future Work
Although the diffusion model has been greatly improved
and has shown strong capabilities, we still have a long way
to go.
Architecture Problem: Currently, the denoiser in diffu-
sion models widely employs U-nets with attention mech-
anisms. However, the convolution layers in these models
struggle to capture long-range dependencies due to their
limited receptive field. Transformers, on the other hand,
excel at handling long-range dependencies. Replacing the
entire U-net with a transformer may help diffusion models
better model the global structure and relationships within
images, especially in complex scenes. Additionally, trans-
formers can easily integrate information from other modal-
ities (such as text, audio, etc.), making them convenient for
conditional control. The text-to-video model SORA[25] has
successfully employed this DiT[26] architecture, and the
yet-to-be-open-sourced SD3[1] also claims to use the DiT
architecture.The latest and most powerful open-source text-
to-image model, Flux[22], boasts 12 billion parameters and
claims to have achieved tremendous success using the DiT
architecture.
Single-Stage vs. Dual-Stage: In this paper, the SD1.5
we used adopts a single-stage diffusion architecture, where
our model directly starts denoising from the initial noise.
Meanwhile, SDXL[27] experimented with a dual-stage dif-
fusion architecture, in which one diffusion is used to gener-
ate the initial image, and another diffusion is used for de-
noising. This model architecture achieved better results,
however, loading two diffusion models consumes a signifi-
cant amount of computational resources.
Evaluation Metrics: Assessing the quality of images gen-
erated by generative models is challenging. Although there
are some metrics available, they still struggle to reflect hu-
man subjective assessment of image quality. Currently, the
Frechet Inception Distance (FID) is commonly used as an
evaluation metric, which assesses the quality of generated
images by comparing the feature distributions of generated
images to real images. It has evaluative significance within
a certain range, but a more comprehensive set of metrics is
still needed to assess the capabilities of generative models.
Generation Details: Although the images output by dif-
fusion models are increasingly difficult for humans to dis-
tinguish as real or fake after continuous denoising, they are
prone to producing noise, and there can be obvious incoher-
ent colors in certain areas. How to optimize this issue is a
significant challenge.
7
References
[1] Stability AI. Stable diffusion 3. https ://stability.
ai/news/stable-diffusion- 3. Accessed: 2024-
08-16. 7
[2] Martin Arjovsky, Soumith Chintala, and L
´
eon Bottou.
Wasserstein generative adversarial networks. In Proceedings
of the 34th International Conference on Machine Learning,
pages 214–223. PMLR, 2017. 2
[3] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended
diffusion for text-driven editing of natural images. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), pages 18208–18218, 2022.
5
[4] Tim Brooks, Aleksander Holynski, and Alexei A. Efros. In-
structpix2pix: Learning to follow image editing instructions.
In CVPR, 2023. 5
[5] Rewon Child. Very deep {vae}s generalize autoregressive
models and can outperform them on images. In International
Conference on Learning Representations, 2021. 1
[6] Prafulla Dhariwal and Alexander Nichol. Diffusion models
beat gans on image synthesis. In Advances in Neural Infor-
mation Processing Systems, pages 8780–8794. Curran Asso-
ciates, Inc., 2021. 2
[7] Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin,
Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-
based text-to-image generation with human priors. In Com-
puter Vision–ECCV 2022: 17th European Conference, Tel
Aviv, Israel, October 23–27, 2022, Proceedings,Part XV,
2022. 5
[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Yoshua Bengio. Generative adversarial nets. In Advances in
Neural Information Processing Systems. Curran Associates,
Inc., 2014. 2
[9] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Yoshua Bengio. Generative adversarial nets. In Advances in
Neural Information Processing Systems. Curran Associates,
Inc., 2014. 1
[10] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent
Dumoulin, and Aaron C Courville. Improved training of
wasserstein gans. In Advances in Neural Information Pro-
cessing Systems. Curran Associates, Inc., 2017. 2
[11] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman,
Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image
editing with cross attention control. 2022. 4
[12] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimen-
sionality of data with neural networks. Science, 313(5786):
504–507, 2006. 1
[13] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif-
fusion probabilistic models. In Advances in Neural Informa-
tion Processing Systems 33: Annual Conference on Neural
Information Processing Systems 2020, NeurIPS 2020, De-
cember 6-12, 2020, virtual, 2020. 2
[14] Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet,
Mohammad Norouzi, and Tim Salimans. Cascaded diffusion
models for high fidelity image generation. J. Mach. Learn.
Res., 23(1), 2022. 3
[15] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-
Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.
Lora: Low-rank adaptation of large language models. In
The Tenth International Conference on Learning Represen-
tations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022.
4
[16] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.
Progressive growing of gans for improved quality, stability,
and variation. In 6th International Conference on Learning
Representations, ICLR 2018, Vancouver, BC, Canada, April
30 - May 3, 2018, Conference Track Proceedings, 2018. 2
[17] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Hui-
wen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani.
Imagic: Text-based real image editing with diffusion mod-
els. In 2023 IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), pages 6007–6017, 2023. 5
[18] Gwanghyun Kim, Taesung Kwon, and Jong-Chul Ye. Diffu-
sionclip: Text-guided diffusion models for robust image ma-
nipulation. 2022 IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), pages 2416–2425, 2021. 5
[19] Diederik P. Kingma and Max Welling. Auto-encoding vari-
ational bayes. In ICLR, 2014. 1
[20] Zhifeng Kong and Wei Ping. On fast sampling of diffusion
probabilistic models. In ICML Workshop on Invertible Neu-
ral Networks, Normalizing Flows, and Explicit Likelihood
Models, 2021. 3
[21] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
Imagenet classification with deep convolutional neural net-
works. In Advances in Neural Information Processing Sys-
tems. Curran Associates, Inc., 2012. 1
[22] Black Forest Labs. Black forest labs official website.
https://blackforestlabs.ai/. Accessed: 2024-
08-16. 7
[23] Sungbin Lim, EUN BI YOON, Taehyun Byun, Taewon
Kang, Seungwoo Kim, Kyungjae Lee, and Sungjoon Choi.
Score-based generative modeling through stochastic evolu-
tion equations in hilbert spaces. In Advances in Neural In-
formation Processing Systems, pages 37799–37812. Curran
Associates, Inc., 2023. 2
[24] Lars Mescheder. On the convergence properties of gan train-
ing. 2018. 2
[25] OpenAI. Sora. https: // openai .com / index /
sora/. 7
[26] William Peebles and Saining Xie. Scalable diffusion mod-
els with transformers. In Proceedings of the IEEE/CVF In-
ternational Conference on Computer Vision (ICCV), pages
4195–4205, 2023. 7
[27] Dustin Podell, Zion English, Kyle Lacey, Andreas
Blattmann, Tim Dockhorn, Jonas M
¨
uller, Joe Penna, and
Robin Rombach. SDXL: Improving latent diffusion models
for high-resolution image synthesis. In The Twelfth Interna-
tional Conference on Learning Representations, 2024. 7
[28] Alec Radford, Luke Metz, and Soumith Chintala. Unsuper-
vised representation learning with deep convolutional gener-
ative adversarial networks. In 4th International Conference
8
on Learning Representations, ICLR 2016, San Juan, Puerto
Rico, May 2-4, 2016, Conference Track Proceedings, 2016.
2
[29] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen
Krueger, and Ilya Sutskever. Learning transferable visual
models from natural language supervision. In Proceedings
of the 38th International Conference on Machine Learning,
ICML 2021, 18-24 July 2021, Virtual Event, 2021. 4, 5
[30] Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Gen-
erating diverse high-fidelity images with vq-vae-2. In Ad-
vances in Neural Information Processing Systems. Curran
Associates, Inc., 2019. 3
[31] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Patrick Esser, and Bj
¨
orn Ommer. High-resolution image
synthesis with latent diffusion models. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), pages 10684–10695, 2022. 3, 4
[32] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Patrick Esser, and Bj
¨
orn Ommer. High-resolution image syn-
thesis with latent diffusion models, 2022. 1
[33] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
net: Convolutional networks for biomedical image segmen-
tation. In Medical image computing and computer-assisted
intervention–MICCAI 2015: 18th international conference,
Munich, Germany, October 5-9, 2015, proceedings, part III
18, pages 234–241. Springer, 2015. 4
[34] Robin San-Roman, Eliya Nachmani, and Lior Wolf. Noise
estimation for generative diffusion models. 2021. 3
[35] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
and Surya Ganguli. Deep unsupervised learning using
nonequilibrium thermodynamics. In Proceedings of the
32nd International Conference on Machine Learning, pages
2256–2265, Lille, France, 2015. PMLR. 2
[36] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning
structured output representation using deep conditional gen-
erative models. In Advances in Neural Information Process-
ing Systems, 2015. 1
[37] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois-
ing diffusion implicit models. In International Conference
on Learning Representations, 2021. 3
[38] Yang Song and Stefano Ermon. Generative modeling by es-
timating gradients of the data distribution. In Advances in
Neural Information Processing Systems. Curran Associates,
Inc., 2019. 2
[39] Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng,
and Xuansong Xie. Anytext: Multilingual visual text gener-
ation and editing. In The Twelfth International Conference
on Learning Representations, 2024. 5
[40] Arash Vahdat and Jan Kautz. Nvae: A deep hierarchical
variational autoencoder. In Advances in Neural Information
Processing Systems, pages 19667–19679. Curran Associates,
Inc., 2020. 1
[41] Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based
generative modeling in latent space. In Advances in Neural
Information Processing Systems, pages 11287–11302. Cur-
ran Associates, Inc., 2021. 3
[42] A
¨
aron van den Oord, Oriol Vinyals, and Koray
Kavukcuoglu. Neural discrete representation learning.
In Advances in Neural Information Processing Systems
30: Annual Conference on Neural Information Processing
Systems 2017, December 4-9, 2017, Long Beach, CA, USA,
2017. 3
[43] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In Advances in Neu-
ral Information Processing Systems 30: Annual Conference
on Neural Information Processing Systems 2017, December
4-9, 2017, Long Beach, CA, USA, 2017. 4
[44] Zhengbo Wang and Jian Liang. Lora-pro: Are low-
rank adapters properly optimized? arXiv preprint
arXiv:2407.18242, 2024. 4
[45] Yuchen Zeng and Kangwook Lee. The expressive power of
low-rank adaptation. In The Twelfth International Confer-
ence on Learning Representations, 2024. 4
[46] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding
conditional control to text-to-image diffusion models. In
2023 IEEE/CVF International Conference on Computer Vi-
sion (ICCV), pages 3813–3824, 2023. 5
9