Kblog

Harnessing the Power of AI: A Comprehensive Review of Image Synthesis with a

Focus on Diffusion Models

Bowen Xue

University of Science and Technology of China

xbwcyt@mail.ustc.edu.cn

Abstract

AI image synthesis has experienced a leap from its in-

ception to rapid development over this decade. We are

delighted to see that, with the enhancement of computing

power and model architectures, our image generation mod-

els have progressed from initially producing images of con-

cerning quality [19] to now being able to generate high-

resolution, perfect images in just a few seconds. [32] In this

article, I will enumerate the pioneering achievements in the

ﬁeld of image generation and focus on discussing the dif-

fusion model framework used by current mainstream image

generation models. I will also showcase my personal ﬁne-

tuning results and experiences, explore the current short-

comings, and discuss potential improvements for future im-

age generation models.

1. Introduction

Before generative models came into prominence, computers

had already learned to perform efﬁcient image recognition

by leveraging deep learning. [21] Our computers possessed

good vision, but at that time, they had little artistic capabil-

ity. As Feynman said, ”What I cannot create, I do not under-

stand.” Recognition and generation complement each other,

with models internalizing the essence of data through far

fewer model parameters than the amount of training data.

If a model clearly understands the essence of data, it can

also generate data. The ﬁrst to break this deadlock was the

Variational Autoencoder (VAE)[19].

1.1. Variational Autoencoder

From a mathematical perspective, what image generation

models do is actually estimating the distribution of real data

in image space.Fig. 3

Variational auto-encoder are a type of auto-

encoderFig. 1, where an auto-encoder is a neural network

used to learn efﬁcient encodings of data[12]. The goal

of an auto-encoder is to compress input data through a

neural network and then reconstruct data that is as close as

possible to the original input. Variational auto-encoder not

only learn a compressed representation of data encoding,

but also learn the probabilistic distribution of the data

encoding, allowing us to sample from the distribution to

generate new data points.

Figure 1. auto-encoder structure

VAE builds upon the traditional autoencoder by adding

Gaussian noise to the encoder’s output, which allows the

decoder to be robust to noise, thereby improving the quality

of the generated results.

VAE is still an unsupervised method, making it difﬁ-

cult to control the output of the decoder. Conditional VAE

(CVAE) introduces conditional variables based on VAE,

turning it into a supervised learning method.[36] In CVAE,

the operations of the encoder and decoder depend not only

on the input data but also on additional conditional vari-

ables, allowing us to make the decoder generate speciﬁc

images based on different conditional variables.

The VAE has completed the task of generating images[5,

40]; however, the VAE did not learn how to produce a real-

istic image, but instead generates images as similar as pos-

sible to those in the training set. Due to this limitation, Gen-

erative Adversarial Nets(GAN)[9] were proposed.

1.2. Generative Adversarial Nets

GAN views the training process as a game between two

separate networks: a generator network and a discrimina-

tor network that tries to classify samples. Whenever the

discriminator notices a difference between the model distri-

bution of the generator and the real distribution, the gener-

ator slightly adjusts its parameters to make this difference

disappear, until ﬁnally the generator accurately reproduces

the real data distribution and the discriminator guesses ran-

domly, unable to discern the differences.Fig. 2

Figure 2. GAN structure

We can naturally use the decoder of a VAE as the gen-

erator for a GAN, while using a CNN as the discriminator.

In GANs, the training of the generator and discriminator is

a dynamic game-theoretic process. If there is a mismatch

in capabilities between the generator and discriminator, it

may lead to training instability, resulting in mode collapse

or non-convergence. Therefore, it is crucial to choose ap-

propriate structures for both the generator and discrimina-

tor.

DCGAN[28] introduces deconvolution layers into the

generator, where the deconvolution (transposed convolu-

tion) layers gradually enlarge the feature maps to produce

clearer images. PGGAN[16] progressively increases the

complexity of the generator and discriminator during the

iteration process. It begins training at low resolution and

gradually adds layers to generate high-resolution images.

In the original GAN, Jensen-Shannon divergence (JS di-

vergence) is chosen to measure the difference between the

real data distribution and the generated data distribution.

However, JS divergence is not a good measure when there

is little overlap between the distributions, as it can lead to

an overly powerful discriminator, causing the generator’s

gradients to vanish and affecting the stability of GAN train-

ing. WGAN[2] proposes the use of the Wasserstein distance

(also known as the Earth-Mover distance) as an alternative

to JS divergence. Wasserstein distance provides a smoother

gradient, even when there is minimal overlap between the

generated and real data distributions.

Just like the core idea of the previously mentioned

CVAE, we can also introduce conditional variables into

GANs, allowing GANs to beneﬁt from human-annotated

samples. This is what CGAN[8] is.

Although the architecture of GANs has been continu-

ously upgraded to generate high-quality images, the inher-

Figure 3. The image illustrates the mathematical concepts of an

image generation model.

ent model structure makes it difﬁcult to train[10, 24]. The

dynamic balance between the generator and the discrimi-

nator is hard to maintain, and the adversarial training in-

volving two networks makes it challenging for the model

to reach a state of convergence. The training losses of the

generator and discriminator are not traditional signals that

can clearly indicate optimization progress, making it difﬁ-

cult to determine the criteria for monitoring and terminating

the training process. This directly led to the emergence of

diffusion models.

2. Related Work

The diffusion model[13] abandons the innovative structure

of GANs: the discriminator, focusing on the generative

model. The diffusion model learns to predict the noise in

images through a process of continually adding noise. The

process of generating images in a diffusion model, which is

also a denoising process, is akin to the famous quote by the

sculptor Michelangelo: ”The sculpture is already complete

within the marble block, before I start my work. It is already

there, I just have to chisel away the superﬂuous material.”

2.1. Diffusion model

Diffusion models[13, 23, 35, 38] are primarily inspired by

the diffusion processes in physics. In the context of gener-

ative models, diffusion models simulate a reverse process:

starting from a disordered random noise state, they gradu-

ally ”denoise”Fig. 4 to generate structured data. This pro-

cess is achieved through multiple iterative steps, each alter-

nating between adding noise to the data and denoising it,

until transitioning from the initial noisy state to a clear data

output[6].

The core mathematical algorithm of the diffusion model

is very concise. We train the model to acquire the ability to

denoise images at any stage. During the sampling process,

we can start with initial noise and continuously denoise in

cycles until a satisfactory image is generated. This gives

the diffusion model excellent interpretability, allowing us

to adjust or stop at any step of the denoising process.

The performance of the diffusion model is quite impres-

sive; however, for all generative models, high resolution re-

mains a bottleneck. This is because an image represented

Figure 4. denoiser structure

Algorithm 1 Training

1: repeat

2: x

∼ q(x

)

3: t ∼ Uniform({1, . . . , T })

4: ϵ ∼ N(0, I)

5: Take gradient descent step on

∇



ϵ − ϵ

(

√

¯α

√

1 − ¯α

ϵ, t)



6: until converged

Algorithm 2 Sampling

1: x

∼ N(0, I)

2: for t = T, . . . , 1 do

3: z ∼ N(0, I) if t > 1, else z = 0

4: x

t−1

√



−

1−α

√

1−¯α

, t)



+ σ

5: end for

6: return x

in RGB with a width of m pixels and a height of n pixels

has a data shape of m*n*3. For a high-resolution image, the

amount of information it contains is enormous, making it

difﬁcult to process such data directly.

Although advanced sampling strategies[20, 34, 37] and

hierarchical methods[14, 41] can accelerate inference speed

in pixel space, training on high-resolution image data al-

ways requires computationally expensive gradients.

2.2. Latent Diffusion model

For this reason, Latent Diffusion Models[31] have been pro-

posed. Unlike traditional diffusion models, which operate

directly in high-dimensional data spaces, Latent Diffusion

Models ﬁrst encode data into a lower-dimensional latent

space. Within this latent space, the model can learn and sim-

ulate data distributions more effectively, as the latent space

provides a more abstract and compressed representation of

the data. This typically results in the generation of higher-

quality samples.

The latent diffusion model primarily trains an encoder

that encodes from pixel space into latent space, a denoiser

that performs noise reduction in latent space, and a decoder

that restores from latent space back to pixel space.

Semantic

Map

crossattention

Latent Space

Conditioning

Text

Diffusion Process

denoising step switch skip connection

Repres

entations

Pixel Space

Images

Denoising U-Net

concat

Figure 5. This is the framework used by Stable Diffusion 1.5.

2.3. VQ-VAE

We often use VQ-VAE[42] as a bridge between pixel space

and latent space. VQ-VAE is a special type of VAE that

incorporates vector quantization (VQ) into the traditional

VAE framework. Vector quantization is a technique that

maps continuous input vectors to the nearest vector (known

as codebook vectors) in a ﬁnite set of vectors. In VQ-VAE,

this technique is used to discretize the latent space, mean-

ing the continuous representations output by the encoder are

mapped to the nearest vector in a predeﬁned, ﬁxed-size set

of codebook vectors.

Discrete latent spaces provide more stable representa-

tions, which are beneﬁcial for generative models to learn

and understand the high-level structures of data.[30] In con-

trast, the continuous latent spaces of traditional VAEs may

lead to unstable generation quality due to the ”holes” prob-

lem in the latent space (i.e., some regions of the latent space

do not correspond to valid data).

2.4. U-net

Figure 6. This is a very simple demonstration of the U-net archi-

tecture, where Dblock represents the process including downsam-

pling, and Ublock represents the process including upsampling.

I use images in pixel space as examples for ease of understand-

ing, though in reality, U-net processes encodings within the latent

space.

In the Stable Diffusion Model 1.5, we use a symmetric

U-Net architecture where the input and output dimensions

are equal. Fig. 6This means that for every downsampling

step in the encoding process, there is a corresponding up-

sampling step. Overall, this ensures that the data shape re-

mains unchanged before and after denoising. In detail, we

employ skip connections between the encoder and decoder

at the same stages. The role of these skip connections is

to directly transfer the feature maps from the corresponding

stage of the encoder to the decoder during each decoding

stage. This helps to recover details that may be lost dur-

ing the downsampling process, thereby better preserving the

original input’s ﬁne features during image reconstruction.

The diffusion model uses the name ”U-Net” following

the original U-Net paper[33], but its architecture has signif-

icantly changed. The latent diffusion model introduces an

attention mechanism in the U-Net, where data goes through

a Spatial Transformer [31]during both downsampling and

upsampling processes. In this Transformer layer, we utilize

the attention mechanism[43]. The self-attention mechanism

allows the model to consider information from all other in-

puts when processing a single input (such as a part of an

image). In image processing, this means the network can

automatically emphasize important areas of the image and

suppress unimportant parts, dynamically adjusting the sig-

niﬁcance of each area based on global content. The cross-

attention mechanism, applied when conditional variables

are input, enables the model to generate corresponding im-

age parts based on the content described in the text, effec-

tively blending textual and visual information to enhance

the relevance and accuracy of the generated images.

2.5. LoRA

Pretrained

Weights

  



  

   









Pretrained

Weights

  



f(x)



Figure 7. Diagram of LoRA[15]

Figure 8. This is a sample image of a Chinese landscape painting

generated by the Stable Diffusion 1.5 model, which I ﬁne-tuned

using the LoRA method. It is combined with AnyText to generate

speciﬁc Chinese characters at speciﬁc locations.

LoRA[15] (Low-Rank Adaptation) is an efﬁcient model

ﬁne-tuning technique designed to endow pre-trained models

with speciﬁc functionalities by adjusting them. This method

achieves efﬁcient compression of parameters by decompos-

ing the parameters to be updated (also known as parame-

ter residuals) into the product of two low-rank matrices. In

LoRA, training only these two low-rank matrices sufﬁces

for the ﬁne-tuning of the model, signiﬁcantly reducing com-

putational costs while maintaining model performance.

LoRA ﬁne-tuning differs from full model ﬁne-tuning in

that we can freely choose the method of matrix decompo-

sition to increase or decrease the amount of parameters we

adjust. This allows us to ﬁne-tune the model with signiﬁ-

cantly lower memory usage and computational power com-

pared to full model ﬁne-tuning.[44, 45]

In my ﬁne-tuning of diffusion, I chose to use the LoRA

method to ﬁne-tune the Spatial Transformer in the U-net of

SD1.5. SD1.5 is a text-to-image pretrained model based on

the LDM structure, trained and open-sourced by Stability

AI. Fig. 8 shows the results I obtained from ﬁne-tuning, and

I will explain the ﬁne-tuning process in detail and present

more results later.

2.6. CLIP

CLIP (Contrastive Language–Image Pre-training) [29] is an

innovative multimodal model developed by OpenAI that

leverages a contrastive learning mechanism to process and

understand the associations between images and text. By

learning from a large number of image-text pairs, the model

uses contrastive learning to enhance its capability of pro-

ducing uniﬁed semantic embeddings for both images and

text.

In the SD1.5 architecture, we utilize the Text Encoder

from CLIP to convert the prompt, which controls image

generation, into embeddings that are applied to the cross-

attention layer[11] of the Spatial Transformer.Through the

cross-attention layers, control variables can manage all the

Figure 9. Diagram of CLIP[29]

denoising processes[3, 4, 7, 17, 18], ultimately generating

images that conform to the control variables.

2.7. Controlnet

Text

Encoder

Prompt c

×3

Output ϵ

( z

, t, c

, c

)

SD Decoder Block A

64×64

SD Decoder Block B

32×32

SD Decoder Block C

16×16

SD Decoder

Block D 8×8

Time

Encoder

Time t

×3

Input z

SD Encoder Block A

64×64

SD Encoder Block B

32×32

SD Encoder Block C

16×16

SD Encoder

Block D 8×8

SD Middle

Block 8×8

×3

zero convolution

Condition c

×3

zero convolution

×3

zero convolution

SD Encoder Block A

64×64 (trainable copy)

SD Encoder Block B

32×32 (trainable copy)

SD Encoder Block C

16×16 (trainable copy)

SD Encoder Block D

8×8 (trainable copy)

SD Middle Block

8×8 (trainable copy)

Prompt&Time

(a) Stable Diﬀusion (b) ControlNet

Figure 10. Diagram of controlnet[46]

In addition to using text as a conditional variable, we

can also input various types of images as control conditions,

such as Canny edges, Hough lines, user scribbles, human

key points, segmentation maps, shape normals, depths, etc.

This allows image generation to be ﬁnely controlled, thanks

to the powerful control capabilities of Controlnet[46].

The core idea of ControlNet is to train an additional neu-

ral network on top of a pre-trained model. This architecture

views large pre-trained models as a powerful support for

learning diverse conditional controls. Trainable replicas and

the original locked model are connected through zero con-

volutional layers, with weights initialized to zero to allow

gradual growth during training. This architecture ensures

that no harmful noise is added to the deep features of the

large diffusion model at the start of training, and protects

the large-scale pre-trained support in the trainable replica

from such noise.

2.8. Anytext

Generating accurate text has always been a major challenge

for diffusion models. Thanks to the ﬁne control demon-

strated by ControlNet over generative models, the work on

AnyText[39] has further ﬁne-tuned Text ControlNet. This

effort has successfully enabled diffusion models to generate

speciﬁc text in designated areas.

Diﬀusion

⊕

Time

Step

VAE

❄

VAE

❄

Auxiliary Latent Module

UNet

❄

Tex t

Control

Net

Prompt (y)

：

Photo of a lush estate with a sign reads "BEVERLY" and "HILLS"

To ke ni ze r !

Glyph Lines !

OCR

Encoder!"

❄

Embeddings

Text Embedding Module

Text Perceptual Loss &

position

#′

position

Perceptual Loss

!′

Text-control Diﬀusion Pipeline

Learnable models

❄

Frozen models

Auxiliary Latent

Text Embedding

Text Perceptual

To ke n Re pl ac e

Glyph "

Fuse Layer

Masked Image "

Te xt E nc o de r "

❄

OCR

Encoder#$

❄

Glyph

Render

Position "

Rand #

Glyph

Render

Glyph

Block

Pos.

Block

(%, %

)

VAE

❄

OCR

Encoder#$

❄

Linear (

Figure 11. Diagram of AnyText[39]

The method for ﬁne-tuning Text ControlNet is quite in-

genious. In order to let the model accurately know the text

we want, we ﬁrst render the corresponding glyph using a

font, then input it into a Linear layer through an OCR en-

coder. This is then entered into the embeddings along with

the prompt corresponding to the text.

For specifying the generation location, we can also ren-

der the glyph in a speciﬁc mask using a font. This, along

with the mask, is then input into a mixing layer as the input

for the ControlNet.

Thus, while diffusion models are capable of generating

images, we can also input various conditional variables to

ﬁnely control the model’s output, even enabling the model

to generate precise, speciﬁc text at designated locations.

3. LoRA on Diffusion model

We have already reviewed the development history of diffu-

sion models and introduced the state-of-the-art techniques

related to diffusion models. Next, I will demonstrate the

excellent generalization capabilities and ease of ﬁne-tuning

of diffusion models through experiments. I will showcase

the capabilities of diffusion models to everyone.

Figure 12. trainset

I have collected 35 images of Chinese landscape paint-

ings from Harvard University’s online art center, the Art

Center at Metropolitan State University, the Princeton Uni-

versity Art Museum, and the Smithsonian Institution as a

training set.

My collection criteria are that the images are clear and

have distinct landscape painting features. I have cropped all

of them to a size of 512x512 pixels and have not performed

any image enhancement beyond that.

Next, I used CLIP to reverse-engineer the prompt for

each image, annotating each training material to simulta-

neously train the text encoder. Through multiple training

sessions, I discovered a labeling technique: to only anno-

tate the corresponding items in the pictures rather than the

layout related to the painting style. This prevents both the

text encoder and the U-net from affecting the style simulta-

neously, which could lead to chaotic results.

I simultaneously trained the U-net and the text encoder

on top of the Stable Diffusion 1.5 base model. I set the

learning rate for training the U-net at 1e-4 and the text en-

coder at 1e-5, as the text encoder is prone to overﬁtting.

I used ’cosine with restarts’ as the learning rate scheduler

and ’AdamW8bit’ as the optimizer. The training was con-

ducted for 20 epochs, with each image being learned ﬁve

times per epoch. I saved the weights after every two epochs.

Figure 13. loss

After 3500 steps of training, we observed that our loss

curve formed beautiful oscillations. Similar to other gener-

ative models, our loss ﬂuctuated between 0.1 and 0.2, indi-

cating that our model has not overﬁtted.

I used an NVIDIA RTX 4080 12GB for both training

and generation processes. I employed ’tree, water, moun-

tain’ as positive prompts and adopted the DPM++ 2M Kar-

ras sampling method. Each image underwent 30 iterations,

with generated image resolution set at 512x512. Using the

ﬁne-tuned model, I continuously generated 25 sample im-

ages Fig. 14, demonstrating that our model has effectively

learned the style of Chinese landscape paintings and pos-

sesses strong generalization capabilities.

Figure 14. output

3.1. LoRA on Diffusion Model with AnyText

If we pay attention to the details, we can notice a ﬂaw in the

images we generate: the images contain unintelligible text.

One approach is to remove all text from the training set, but

we can also use Anytext to generate meaningful text. Note

that during training, Anytext freezes the parameters of U-

net and only trains the text controlnet. The ﬁne-tuning we

perform on U-net with LoRA does not affect the effective-

ness of the ﬁne-tuning on the text controlnet.

Therefore, using the same parameter ﬁle, I ﬁne-tuned the

U-net with the LoRA method while Anytext was ﬁne-tuning

the controlnet of SD1.5.

Figure 15. output with anytext

After loading the model ﬁle I trained using the LoRA

method and Anytext, I found that its performance matched

my expectations. I generated distinctly styled texts with

special meanings in the top-left corner. The model success-

fully created Chinese landscape paintings with calligraphic

capabilities.

I have open-sourced the model ﬁle I trained on GitHub

and written a detailed README to guide its use.

4. Conclusion

Making AI omnipotent has always been the dream of artiﬁ-

cial intelligence scientists. Starting from the origins of gen-

erative models, we discuss up to the diffusion model, enu-

merating the key technologies and innovations that play a

crucial role, marveling at the achievements reached by gen-

erative models, and demonstrating their power through ex-

periments with ﬁne-tuned models. At the end of the article,

I personally tried the entire process of training, generating,

and improving a diffusion model, and succeeded. We have

seen the great potential and prospects of generative mod-

els.”

5. Future Work

Although the diffusion model has been greatly improved

and has shown strong capabilities, we still have a long way

to go.

Architecture Problem: Currently, the denoiser in diffu-

sion models widely employs U-nets with attention mech-

anisms. However, the convolution layers in these models

struggle to capture long-range dependencies due to their

limited receptive ﬁeld. Transformers, on the other hand,

excel at handling long-range dependencies. Replacing the

entire U-net with a transformer may help diffusion models

better model the global structure and relationships within

images, especially in complex scenes. Additionally, trans-

formers can easily integrate information from other modal-

ities (such as text, audio, etc.), making them convenient for

conditional control. The text-to-video model SORA[25] has

successfully employed this DiT[26] architecture, and the

yet-to-be-open-sourced SD3[1] also claims to use the DiT

architecture.The latest and most powerful open-source text-

to-image model, Flux[22], boasts 12 billion parameters and

claims to have achieved tremendous success using the DiT

architecture.

Single-Stage vs. Dual-Stage: In this paper, the SD1.5

we used adopts a single-stage diffusion architecture, where

our model directly starts denoising from the initial noise.

Meanwhile, SDXL[27] experimented with a dual-stage dif-

fusion architecture, in which one diffusion is used to gener-

ate the initial image, and another diffusion is used for de-

noising. This model architecture achieved better results,

however, loading two diffusion models consumes a signiﬁ-

cant amount of computational resources.

Evaluation Metrics: Assessing the quality of images gen-

erated by generative models is challenging. Although there

are some metrics available, they still struggle to reﬂect hu-

man subjective assessment of image quality. Currently, the

Frechet Inception Distance (FID) is commonly used as an

evaluation metric, which assesses the quality of generated

images by comparing the feature distributions of generated

images to real images. It has evaluative signiﬁcance within

a certain range, but a more comprehensive set of metrics is

still needed to assess the capabilities of generative models.

Generation Details: Although the images output by dif-

fusion models are increasingly difﬁcult for humans to dis-

tinguish as real or fake after continuous denoising, they are

prone to producing noise, and there can be obvious incoher-

ent colors in certain areas. How to optimize this issue is a

signiﬁcant challenge.

References

[1] Stability AI. Stable diffusion 3. https ://stability.

ai/news/stable-diffusion- 3. Accessed: 2024-

08-16. 7

[2] Martin Arjovsky, Soumith Chintala, and L

eon Bottou.

Wasserstein generative adversarial networks. In Proceedings

of the 34th International Conference on Machine Learning,

pages 214–223. PMLR, 2017. 2

[3] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended

diffusion for text-driven editing of natural images. In Pro-

ceedings of the IEEE/CVF Conference on Computer Vision

and Pattern Recognition (CVPR), pages 18208–18218, 2022.

[4] Tim Brooks, Aleksander Holynski, and Alexei A. Efros. In-

structpix2pix: Learning to follow image editing instructions.

In CVPR, 2023. 5

[5] Rewon Child. Very deep {vae}s generalize autoregressive

models and can outperform them on images. In International

Conference on Learning Representations, 2021. 1

[6] Prafulla Dhariwal and Alexander Nichol. Diffusion models

beat gans on image synthesis. In Advances in Neural Infor-

mation Processing Systems, pages 8780–8794. Curran Asso-

ciates, Inc., 2021. 2

[7] Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin,

Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-

based text-to-image generation with human priors. In Com-

puter Vision–ECCV 2022: 17th European Conference, Tel

Aviv, Israel, October 23–27, 2022, Proceedings,Part XV,

2022. 5

[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing

Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and

Yoshua Bengio. Generative adversarial nets. In Advances in

Neural Information Processing Systems. Curran Associates,

Inc., 2014. 2

[9] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing

Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and

Yoshua Bengio. Generative adversarial nets. In Advances in

Neural Information Processing Systems. Curran Associates,

Inc., 2014. 1

[10] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent

Dumoulin, and Aaron C Courville. Improved training of

wasserstein gans. In Advances in Neural Information Pro-

cessing Systems. Curran Associates, Inc., 2017. 2

[11] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kﬁr Aberman,

Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image

editing with cross attention control. 2022. 4

[12] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimen-

sionality of data with neural networks. Science, 313(5786):

504–507, 2006. 1

[13] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif-

fusion probabilistic models. In Advances in Neural Informa-

tion Processing Systems 33: Annual Conference on Neural

Information Processing Systems 2020, NeurIPS 2020, De-

cember 6-12, 2020, virtual, 2020. 2

[14] Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet,

Mohammad Norouzi, and Tim Salimans. Cascaded diffusion

models for high ﬁdelity image generation. J. Mach. Learn.

Res., 23(1), 2022. 3

[15] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-

Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.

Lora: Low-rank adaptation of large language models. In

The Tenth International Conference on Learning Represen-

tations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022.

[16] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.

Progressive growing of gans for improved quality, stability,

and variation. In 6th International Conference on Learning

Representations, ICLR 2018, Vancouver, BC, Canada, April

30 - May 3, 2018, Conference Track Proceedings, 2018. 2

[17] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Hui-

wen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani.

Imagic: Text-based real image editing with diffusion mod-

els. In 2023 IEEE/CVF Conference on Computer Vision and

Pattern Recognition (CVPR), pages 6007–6017, 2023. 5

[18] Gwanghyun Kim, Taesung Kwon, and Jong-Chul Ye. Diffu-

sionclip: Text-guided diffusion models for robust image ma-

nipulation. 2022 IEEE/CVF Conference on Computer Vision

and Pattern Recognition (CVPR), pages 2416–2425, 2021. 5

[19] Diederik P. Kingma and Max Welling. Auto-encoding vari-

ational bayes. In ICLR, 2014. 1

[20] Zhifeng Kong and Wei Ping. On fast sampling of diffusion

probabilistic models. In ICML Workshop on Invertible Neu-

ral Networks, Normalizing Flows, and Explicit Likelihood

Models, 2021. 3

[21] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.

Imagenet classiﬁcation with deep convolutional neural net-

works. In Advances in Neural Information Processing Sys-

tems. Curran Associates, Inc., 2012. 1

[22] Black Forest Labs. Black forest labs ofﬁcial website.

https://blackforestlabs.ai/. Accessed: 2024-

08-16. 7

[23] Sungbin Lim, EUN BI YOON, Taehyun Byun, Taewon

Kang, Seungwoo Kim, Kyungjae Lee, and Sungjoon Choi.

Score-based generative modeling through stochastic evolu-

tion equations in hilbert spaces. In Advances in Neural In-

formation Processing Systems, pages 37799–37812. Curran

Associates, Inc., 2023. 2

[24] Lars Mescheder. On the convergence properties of gan train-

ing. 2018. 2

[25] OpenAI. Sora. https: // openai .com / index /

sora/. 7

[26] William Peebles and Saining Xie. Scalable diffusion mod-

els with transformers. In Proceedings of the IEEE/CVF In-

ternational Conference on Computer Vision (ICCV), pages

4195–4205, 2023. 7

[27] Dustin Podell, Zion English, Kyle Lacey, Andreas

Blattmann, Tim Dockhorn, Jonas M

uller, Joe Penna, and

Robin Rombach. SDXL: Improving latent diffusion models

for high-resolution image synthesis. In The Twelfth Interna-

tional Conference on Learning Representations, 2024. 7

[28] Alec Radford, Luke Metz, and Soumith Chintala. Unsuper-

vised representation learning with deep convolutional gener-

ative adversarial networks. In 4th International Conference

on Learning Representations, ICLR 2016, San Juan, Puerto

Rico, May 2-4, 2016, Conference Track Proceedings, 2016.

[29] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya

Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,

Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen

Krueger, and Ilya Sutskever. Learning transferable visual

models from natural language supervision. In Proceedings

of the 38th International Conference on Machine Learning,

ICML 2021, 18-24 July 2021, Virtual Event, 2021. 4, 5

[30] Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Gen-

erating diverse high-ﬁdelity images with vq-vae-2. In Ad-

vances in Neural Information Processing Systems. Curran

Associates, Inc., 2019. 3

[31] Robin Rombach, Andreas Blattmann, Dominik Lorenz,

Patrick Esser, and Bj

orn Ommer. High-resolution image

synthesis with latent diffusion models. In Proceedings of

the IEEE/CVF Conference on Computer Vision and Pattern

Recognition (CVPR), pages 10684–10695, 2022. 3, 4

[32] Robin Rombach, Andreas Blattmann, Dominik Lorenz,

Patrick Esser, and Bj

orn Ommer. High-resolution image syn-

thesis with latent diffusion models, 2022. 1

[33] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-

net: Convolutional networks for biomedical image segmen-

tation. In Medical image computing and computer-assisted

intervention–MICCAI 2015: 18th international conference,

Munich, Germany, October 5-9, 2015, proceedings, part III

18, pages 234–241. Springer, 2015. 4

[34] Robin San-Roman, Eliya Nachmani, and Lior Wolf. Noise

estimation for generative diffusion models. 2021. 3

[35] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,

and Surya Ganguli. Deep unsupervised learning using

nonequilibrium thermodynamics. In Proceedings of the

32nd International Conference on Machine Learning, pages

2256–2265, Lille, France, 2015. PMLR. 2

[36] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning

structured output representation using deep conditional gen-

erative models. In Advances in Neural Information Process-

ing Systems, 2015. 1

[37] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois-

ing diffusion implicit models. In International Conference

on Learning Representations, 2021. 3

[38] Yang Song and Stefano Ermon. Generative modeling by es-

timating gradients of the data distribution. In Advances in

Neural Information Processing Systems. Curran Associates,

Inc., 2019. 2

[39] Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng,

and Xuansong Xie. Anytext: Multilingual visual text gener-

ation and editing. In The Twelfth International Conference

on Learning Representations, 2024. 5

[40] Arash Vahdat and Jan Kautz. Nvae: A deep hierarchical

variational autoencoder. In Advances in Neural Information

Processing Systems, pages 19667–19679. Curran Associates,

Inc., 2020. 1

[41] Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based

generative modeling in latent space. In Advances in Neural

Information Processing Systems, pages 11287–11302. Cur-

ran Associates, Inc., 2021. 3

[42] A

aron van den Oord, Oriol Vinyals, and Koray

Kavukcuoglu. Neural discrete representation learning.

In Advances in Neural Information Processing Systems

30: Annual Conference on Neural Information Processing

Systems 2017, December 4-9, 2017, Long Beach, CA, USA,

2017. 3

[43] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-

reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia

Polosukhin. Attention is all you need. In Advances in Neu-

ral Information Processing Systems 30: Annual Conference

on Neural Information Processing Systems 2017, December

4-9, 2017, Long Beach, CA, USA, 2017. 4

[44] Zhengbo Wang and Jian Liang. Lora-pro: Are low-

rank adapters properly optimized? arXiv preprint

arXiv:2407.18242, 2024. 4

[45] Yuchen Zeng and Kangwook Lee. The expressive power of

low-rank adaptation. In The Twelfth International Confer-

ence on Learning Representations, 2024. 4

[46] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding

conditional control to text-to-image diffusion models. In

2023 IEEE/CVF International Conference on Computer Vi-

sion (ICCV), pages 3813–3824, 2023. 5