Google’s Snap & Transform: Unlocking Surprising Image Magic with MobileDiffusion
What is MobileDiffusion?
Google has unlocked a surprising element for all of us. As a result, we provide a new method that could allow for fast on-device text-to-image generation in “MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices”. One effective latent diffusion model developed for mobile devices is MobileDiffusion. We also use DiffusionGAN to do one-step sampling during inference; this method uses a GAN to simulate the denoising phase and fine-tunes a pre-trained diffusion model. We ran tests on premium Android and iOS smartphones and found that MobileDiffusion could produce a high-quality 512×512 image in half a second. Particularly well-suited for use in mobile environments, its model size of 520 million parameters is rather modest. There has been a lot of recent activity in the field of improving text-to-image diffusion models’ inference efficiency. Minimizing the amount of function evaluations (NFEs) has been the primary focus of prior research. The number of required sampling steps has been drastically decreased from many hundreds to single digits by utilizing effective distillation techniques or powerful numerical solvers like DPM. There is only one more stage required by certain newer methods, such as Adversarial Diffusion Distillation and DiffusionGAN.
Read: 10 AI In Manufacturing Trends To Look Out For In 2024
The complexity of the model’s architecture, however, makes even a few evaluation steps sluggish on mobile devices. Less emphasis has been placed on the architectural efficiency of text-to-image dispersion methods so far. Some earlier works (like SnapFusion) touch on this topic briefly; they deal with deduplicating neural network blocks. Nevertheless, these endeavors fall short of offering a complete manual for developing extremely efficient structures since they do not conduct an exhaustive evaluation of every part of the model architecture.Latent diffusion models provide as inspiration for MobileDiffusion’s design. An image decoder, a diffusion UNet, and a text encoder make up its three parts. We employ CLIP-ViT/L14, a mobile-friendly compact model with 125M parameters, for the text encoder. Next, we’ll examine the picture decoder and diffusion UNet.
Read: Top 10 News of Google in 2023
UViT architecture
In traditional text-to-image diffusion models, a transformer block is comprised of three layers: an attention layer for visual feature long-range dependencies, an attention layer for text conditioning and feature interactions, and a feed-forward layer for post-processing the attention layers’ output. When it comes to text-to-image diffusion models, these transformer blocks are the main actors in terms of text understanding. The computational expense of the attention operation is quadratic to the sequence length, which makes them a serious efficiency problem. Our design is based on the UViT architecture, which suggests adding extra transformer blocks to the UNet’s bottleneck. The decreased dimensionality of the attention calculation makes it less resource-intensive at the bottleneck, which is why this design option was made.
Read: 10 AI In Energy Management Trends To Look Out For In 2024
When given text as a trigger, text-to-image diffusion models can produce very good images. The most popular models, such as Stable Diffusion, DALL·E, and Imagen, include billions of parameters and are thus costly to execute because they necessitate powerful workstations or servers. Although there have been some improvements in inference solutions for Android and iOS in the last year, the ability to convert text to images on mobile devices at a rapid pace (sub-second) is still a ways off.
Features
Their optimizations extended to the image decoder as well as the UNet. They reduced the spatial dimension of the image by 8 times and trained a variational autoencoder (VAE) to convert an RGB image to an 8-channel latent variable. The size of a latent variable increases by 8 times when it is decoded to a picture. By reducing the original’s breadth and depth, they create a decoder architecture that is both lightweight and efficient. The end result is a small decoder that significantly improves performance, with a latency improvement of over 50% and higher quality. Our paper has more information.
They initialized the generator and discriminator with a pre-trained diffusion UNet to solve these issues. The pre-trained diffusion model can be easily initialized thanks to this approach. They hypothesised that the diffusion model’s underlying properties hold valuable details about the complex interaction between visual and textual inputs. This method of starting out greatly simplifies the training process.
[To share your insights with us as part of editorial or sponsored content, please write to sghosh@martechseries.com]
Comments are closed.