Of Reduced Latency and Retaining Information: Benefits of Token Merging
In a revolutionary move for the research community, Meta AI, the artificial intelligence lab of Meta Platforms unveiled its new research to reduce the latency of existing Vision Transformer (ViT) models minus any additional training.
The approach referred to as Token Merging (ToMe) brings together similar tokens in order to reduce computation and keep the information intact. With a lightweight algorithm, Token Merging has the ability to merge tokens in each layer sans any detrimental overhead.
ToMe has been assessed on a variety of major datasets in different domains such as ImageNet 1K (images), K400 (videos), and AudioSet-2M (audio). The evaluation revealed an increased inference throughout, almost 3 times, with minimal accuracy loss. To help the research community use it to build these advancements, Meta AI released the ToMe code and benchmarks.
Token Merging can cut inference time in half & we expect it to unlock more use of large-scale ViT models in real-world applications. To enable the research community to build upon these advancements we’ve released the code and benchmarks here ⬇️https://t.co/2fYpwD7WqM
— Meta AI (@MetaAI) February 13, 2023
Recommended: How the Marriage of AI and H.I. Impacts Healthcare Costs
Token Merging – How it functions
To begin with, ViT converts image patches into “tokens” before applying an attention mechanism in each layer which further allows the tokens to gather data from each other, proportional to their similarity.
To enhance the speed of ViT while preserving its reliability, ToMe is based on two observations:
- The computation speed and the memory depend on the number of tokens in the transformer.
- And, these tokens are often redundant and so, ToMe merges these redundant tokens based on similarity, thus reducing the number of tokens while retaining information.
This is in stark contrast to prior work in token pruning, which deletes tokens outright, meaning removing important information.
The best part about this approach is that it is simple and can easily blend into the existing transformers. To consolidate similar tokens, ToMe makes use of a fast and lightweight matching function at its core. In addition, this function can be easily inserted into the middle of any standard transformer block without much overhead.
During inference, ToMe works on reducing the number of tokens over the course of the network, hence cutting down the time taken. Since ToMe does not meddle with the model, it can be used along with existing tools to enhance the speed of transformers like formers or half precision.
Recommended: Fy!’s Generative AI Tool Enables Customers to Create & Buy Their Own Masterpiece
Token Merging – the Scope
Generality
When applied to ViT models trained on images, video, and audio, ToMe doubled inference speed across all modalities and at the same time, had a negligible impact on accuracy. In several cases, additional training was not required.
Scalability
ToMe works with enhanced efficiency with large models and large inputs. In the case of ViT models with different numbers of parameters and image sizes without additional training, the performance drop consistently decreased when the model and image size increased. This characteristic is especially important and helpful when it comes to deploying large-scale transformer models. ToMe increases Stable Diffusion (text-to-image model) by 1.7x and brings down memory usage by 63 percent without losing any details.
Versatility
In the networks that use standard transformer blocks, ToMe can be applied easily. Though the focus is mostly on ViT models in the paper, ToMe is capable of accelerating and also decreasing the memory usage of popular architectures, like Stable Diffusion, with the slightest loss of visual quality.
Final Thoughts
Right from the beginning, there has been tremendous advancement in fields including computer vision, scalability, and generalization across different domains, while gauging the scope of powerful unsupervised learning. But sometimes, hardware limitations and time constraints can often hinder running massive models. And so, as a result, convolutional models, despite their compromised accuracy, are still in practice. With ToMe, the inference time of ViT models can be reduced drastically, hence supporting the use of large-scale ViT models in real-world applications.
[To share your insights with us, please write to sghosh@martechseries.com].
Comments are closed.