Machine Learning Learning Projects
by Wuyang
Continuing from the last project "build cartoon avatar diffusion model from scratch", I crafted a new model for generating cartoon avatars using the Huggingface Diffusers library.
For additional insights into this model, refer to the linked Jupyter note notebook for comprehensive details.
The project setup closely mirrors the previous one, encompassing the dataset, condition component, noise schedule, and denoising process, which all remain identical. The divergences lie in two key aspects:
As shown in the following code block, the UNet is an instance of UNet2DConditionModel in the Huggingface diffusers library. Both down and up blocks are CrossAttn blocks. What stands out is that the model utilizes 32 attention heads.
Apparently, the Unet is more complex in terms of structure than the one we built from scratch. Does increased model complexity result in visually superior generated images?
UNet2DConditionModel((64, 64), 3, 3,
down_block_types=("CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "CrossAttnDownBlock2D"),
up_block_types=("CrossAttnUpBlock2D", "CrossAttnUpBlock2D","CrossAttnUpBlock2D"),
block_out_channels=(128, 256, 512),
cross_attention_dim=1280,
layers_per_block=2,
only_cross_attention=True,
attention_head_dim=32)
trainable model parameters: 188511363
all model parameters: 188511363
percentage of trainable model parameters: 100.00%
model weight size: 719.11 MB
adam optimizer size: 1438.23 MB
gradients size: 719.11 MB
Below are a random samples of generated cartoon avatars after training for 6 epochs.
generated images at epoch 6
How do they compare to the images generated in the project "build cartoon avatar diffusion model from scratch" ?
generated images of the model built from scratch at epoch 18
I also trained Unets with other settings, their main differences from the model outline above are:
Observing the generated images gives us a clear indication: With 8 attention heads, the model struggles to accurately depict color and hairstyle.
UNet2DConditionModel((64, 64), 3, 3,
down_block_types=("CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "CrossAttnDownBlock2D"),
up_block_types=("CrossAttnUpBlock2D", "CrossAttnUpBlock2D","CrossAttnUpBlock2D"),
block_out_channels=(128, 256, 512),
cross_attention_dim=1280,
layers_per_block=2,
attention_head_dim=8) # attention_head_dim default value is 8
Increasing the number of heads to 16 improves color representation, but the portrayal of hairstyles remains lacking.
UNet2DConditionModel((64, 64), 3, 3,
down_block_types=("CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "CrossAttnDownBlock2D"),
up_block_types=("CrossAttnUpBlock2D", "CrossAttnUpBlock2D","CrossAttnUpBlock2D"),
block_out_channels=(128, 256, 512),
cross_attention_dim=1280,
layers_per_block=2,
attention_head_dim=16)
trainable model parameters: 180483203
all model parameters: 180483203
percentage of trainable model parameters: 100.00%
model weight size: 688.49 MB
adam optimizer size: 1376.98 MB
gradients size: 688.49 MB
We used a fixed guidance 2.0 in the last blog and didn't play with this parameter. According to Exploring stable diffusion guidance,
The guidance scale parameter controls how strongly the generation process is guided towards the prompt. A value of 0 is equivalent to no guidance, and the image is completely unrelated to the prompt (it can however, still be an interesting image). Increasing the guidance scale increases how closely the image resembles the prompt, but reduces consistency and tends to lead to an overstaturated and oversharpened image.
Can we witness similar influences of guidance on the cartoon avatar model?
Using identical condition embeddings, I ramped up the guidance from 0.0 to 9.0 and produced 16 samples for each guidance value.
The table below illustrates that at a guidance value of 0, images appear random. By the time the guidance reaches 1.0, the images already show discernible patterns, and there isn't a significant qualitative distinction among images generated with guidance values ranging from 2.0 to 9.0.
| guidance scale | generated images given the same conditions |
|---|---|
| 0.0 | ![]() |
| 1.0 | ![]() |
| 2.0 | ![]() |
| 3.0 | ![]() |
| 5.0 | ![]() |
| 7.0 | ![]() |
| 9.0 | ![]() |
| epoch | random samples | observation |
|---|---|---|
| 0 | ![]() |
learn outline and features of faces |
| 1 | ![]() |
start picking up colors |
| 2 | ![]() |
continue learning hairstyle and color |
| 3 | ![]() |
continue learning hairstyle and color |
after epoch 3, the loss doesn't decrease dramatically and gradually statuates at epoch 6.
The two models have distinct complexities and settings, making a direct comparison unfair.
Nevertheless, for the sake of completeness in this blog series, we conduct a comparison between them, acknowledging that it's not an apples-to-apples scenario.
| model built from scratch | model built with diffusers | |
|---|---|---|
| number of parameters | 76M | 180M |
| denoising model | Tau + Unet | Tau + Unet |
| Unet basic blocks | resnet blocks, multi-head cross attention | resnet blocks, transformer block (cross attention only) |
| number of attention head | 4 | 32 |
| training time per epoch | ~10 mins | ~120 mins |
| optimizer epoch | Adam | AdamW |
| learning rate schedule | StepLR | cosine_schedule_with_warmup |
| strength | excels in accurately depicting shapes, particularly a diverse range of hairstyles | produces images with pristine backgrounds and vibrant colors |
| weakness | tends to generate images with numerous noisy pixels and occasional color corruption | struggles in accurately portraying various styles of hairstyles |
| example | ![]() |
![]() |