11 January 2024

Build Cartoon Avatar Diffusion Model using HuggingFace diffusers

by Wuyang

cartoon avatar diffusion random samples hg diffusers

Continuing from the last project "build cartoon avatar diffusion model from scratch", I crafted a new model for generating cartoon avatars using the Huggingface Diffusers library.

For additional insights into this model, refer to the linked Jupyter note notebook for comprehensive details.

The project setup closely mirrors the previous one, encompassing the dataset, condition component, noise schedule, and denoising process, which all remain identical. The divergences lie in two key aspects:

Unet (implementation and structure)
Optimizer and learning rate schedule

Unet built with HG diffusers

As shown in the following code block, the UNet is an instance of UNet2DConditionModel in the Huggingface diffusers library. Both down and up blocks are CrossAttn blocks. What stands out is that the model utilizes 32 attention heads. Apparently, the Unet is more complex in terms of structure than the one we built from scratch. Does increased model complexity result in visually superior generated images?

UNet2DConditionModel((64, 64), 3, 3, 
                            down_block_types=("CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "CrossAttnDownBlock2D"),
                            up_block_types=("CrossAttnUpBlock2D", "CrossAttnUpBlock2D","CrossAttnUpBlock2D"),
                            block_out_channels=(128, 256, 512),
                            cross_attention_dim=1280,
                            layers_per_block=2,
                            only_cross_attention=True,
                            attention_head_dim=32)

trainable model parameters: 188511363
all model parameters: 188511363
percentage of trainable model parameters: 100.00%
model weight size: 719.11 MB
adam optimizer size: 1438.23 MB
gradients size: 719.11 MB

Below are a random samples of generated cartoon avatars after training for 6 epochs.

attn head 32 cross attn only

generated images at epoch 6

How do they compare to the images generated in the project "build cartoon avatar diffusion model from scratch" ?

generated images of the model built from scratch at epoch 18

Other Attempts

I also trained Unets with other settings, their main differences from the model outline above are:

They use both cross-attention and self-attention.
They use fewer attention heads, 16 and 8 respectively. whereas the final version of the model uses 32 attention heads.

Observing the generated images gives us a clear indication: With 8 attention heads, the model struggles to accurately depict color and hairstyle.

UNet2DConditionModel((64, 64), 3, 3, 
                            down_block_types=("CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "CrossAttnDownBlock2D"),
                            up_block_types=("CrossAttnUpBlock2D", "CrossAttnUpBlock2D","CrossAttnUpBlock2D"),
                            block_out_channels=(128, 256, 512),
                            cross_attention_dim=1280,
                            layers_per_block=2,
                            attention_head_dim=8) # attention_head_dim default value is 8

attn head 8

Increasing the number of heads to 16 improves color representation, but the portrayal of hairstyles remains lacking.

UNet2DConditionModel((64, 64), 3, 3, 
                            down_block_types=("CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "CrossAttnDownBlock2D"),
                            up_block_types=("CrossAttnUpBlock2D", "CrossAttnUpBlock2D","CrossAttnUpBlock2D"),
                            block_out_channels=(128, 256, 512),
                            cross_attention_dim=1280,
                            layers_per_block=2,
                            attention_head_dim=16)

trainable model parameters: 180483203
all model parameters: 180483203
percentage of trainable model parameters: 100.00%
model weight size: 688.49 MB
adam optimizer size: 1376.98 MB
gradients size: 688.49 MB

attn head 16

Play with guidance scale

We used a fixed guidance 2.0 in the last blog and didn't play with this parameter. According to Exploring stable diffusion guidance,

The guidance scale parameter controls how strongly the generation process is guided towards the prompt. A value of 0 is equivalent to no guidance, and the image is completely unrelated to the prompt (it can however, still be an interesting image). Increasing the guidance scale increases how closely the image resembles the prompt, but reduces consistency and tends to lead to an overstaturated and oversharpened image.

Can we witness similar influences of guidance on the cartoon avatar model?

Using identical condition embeddings, I ramped up the guidance from 0.0 to 9.0 and produced 16 samples for each guidance value.

The table below illustrates that at a guidance value of 0, images appear random. By the time the guidance reaches 1.0, the images already show discernible patterns, and there isn't a significant qualitative distinction among images generated with guidance values ranging from 2.0 to 9.0.

cartoon avatar diffusion guidance scale from 0 to 9

guidance scale	generated images given the same conditions
0.0
1.0
2.0
3.0
5.0
7.0
9.0

What does the model learn over the epochs?

epoch	random samples	observation
0		learn outline and features of faces
1		start picking up colors
2		continue learning hairstyle and color
3		continue learning hairstyle and color

after epoch 3, the loss doesn't decrease dramatically and gradually statuates at epoch 6.

Model built with HG diffusers vs from scratch

The two models have distinct complexities and settings, making a direct comparison unfair.

Nevertheless, for the sake of completeness in this blog series, we conduct a comparison between them, acknowledging that it's not an apples-to-apples scenario.

	model built from scratch	model built with diffusers
number of parameters	76M	180M
denoising model	Tau + Unet	Tau + Unet
Unet basic blocks	resnet blocks, multi-head cross attention	resnet blocks, transformer block (cross attention only)
number of attention head	4	32
training time per epoch	~10 mins	~120 mins
optimizer epoch	Adam	AdamW
learning rate schedule	StepLR	cosine_schedule_with_warmup
strength	excels in accurately depicting shapes, particularly a diverse range of hairstyles	produces images with pristine backgrounds and vibrant colors
weakness	tends to generate images with numerous noisy pixels and occasional color corruption	struggles in accurately portraying various styles of hairstyles
example

tags: