Of Reduced Latency and Retaining Information: Benefits of Token Merging

By AiT Analyst On Feb 14, 2023

In a revolutionary move for the research community, Meta AI, the artificial intelligence lab of Meta Platforms unveiled its new research to reduce the latency of existing Vision Transformer (ViT) models minus any additional training.

The approach referred to as Token Merging (ToMe) brings together similar tokens in order to reduce computation and keep the information intact. With a lightweight algorithm, Token Merging has the ability to merge tokens in each layer sans any detrimental overhead.

ToMe has been assessed on a variety of major datasets in different domains such as ImageNet 1K (images), K400 (videos), and AudioSet-2M (audio). The evaluation revealed an increased inference throughout, almost 3 times, with minimal accuracy loss. To help the research community use it to build these advancements, Meta AI released the ToMe code and benchmarks.

Token Merging can cut inference time in half & we expect it to unlock more use of large-scale ViT models in real-world applications. To enable the research community to build upon these advancements we’ve released the code and benchmarks here ⬇️https://t.co/2fYpwD7WqM

— Meta AI (@MetaAI) February 13, 2023

Token Merging – How it functions

To begin with, ViT converts image patches into “tokens” before applying an attention mechanism in each layer which further allows the tokens to gather data from each other, proportional to their similarity.

To enhance the speed of ViT while preserving its reliability, ToMe is based on two observations:

The computation speed and the memory depend on the number of tokens in the transformer.
And, these tokens are often redundant and so, ToMe merges these redundant tokens based on similarity, thus reducing the number of tokens while retaining information.

This is in stark contrast to prior work in token pruning, which deletes tokens outright, meaning removing important information.

The best part about this approach is that it is simple and can easily blend into the existing transformers. To consolidate similar tokens, ToMe makes use of a fast and lightweight matching function at its core. In addition, this function can be easily inserted into the middle of any standard transformer block without much overhead.

MuleRun Launches Creator Studio, the World’s First Platform Built for AI Agent Monetization

Dec 26, 2025

MeetKai and GSMA collaborate to Close the Global AI Language Gap for Low-Resource Languages

Dec 24, 2025

Tangentia Launches Industry First EDI AI Agent to Transform Supply Chains

Dec 24, 2025

Prev Next 1 of 42,288

During inference, ToMe works on reducing the number of tokens over the course of the network, hence cutting down the time taken. Since ToMe does not meddle with the model, it can be used along with existing tools to enhance the speed of transformers like formers or half precision.

Token Merging – the Scope

Generality

When applied to ViT models trained on images, video, and audio, ToMe doubled inference speed across all modalities and at the same time, had a negligible impact on accuracy. In several cases, additional training was not required.

Scalability

ToMe works with enhanced efficiency with large models and large inputs. In the case of ViT models with different numbers of parameters and image sizes without additional training, the performance drop consistently decreased when the model and image size increased. This characteristic is especially important and helpful when it comes to deploying large-scale transformer models. ToMe increases Stable Diffusion (text-to-image model) by 1.7x and brings down memory usage by 63 percent without losing any details.

Versatility

In the networks that use standard transformer blocks, ToMe can be applied easily. Though the focus is mostly on ViT models in the paper, ToMe is capable of accelerating and also decreasing the memory usage of popular architectures, like Stable Diffusion, with the slightest loss of visual quality.

Final Thoughts

Right from the beginning, there has been tremendous advancement in fields including computer vision, scalability, and generalization across different domains, while gauging the scope of powerful unsupervised learning. But sometimes, hardware limitations and time constraints can often hinder running massive models. And so, as a result, convolutional models, despite their compromised accuracy, are still in practice. With ToMe, the inference time of ViT models can be reduced drastically, hence supporting the use of large-scale ViT models in real-world applications.

[To share your insights with us, please write to sghosh@martechseries.com].

Of Reduced Latency and Retaining Information: Benefits of Token Merging

Token Merging – How it functions

Token Merging – the Scope

Generality

Scalability

Versatility

Final Thoughts

Quick Links

Visit Our Other Sites

Follow Us

Interested in our Customized Editorial Services?

Please fill your details and we’ll get in touch with you!

NEWS

INTERVIEWS

INSIGHTS

AI RADAR

SERVICES

SUBSCRIBE

CONTACT US

Brought to you by

To repurpose or use any of the content or material on this and our sister sites, explicit written permission needs to be sought.

Copyright © 2025 AiThority. All Rights Reserved. Privacy Policy

Of Reduced Latency and Retaining Information: Benefits of Token Merging

Token Merging – How it functions

Token Merging – the Scope

Generality

Scalability

Versatility

Final Thoughts

Quick Links

Visit Our Other Sites

Follow Us

Interested in our Customized Editorial Services?

﻿Please fill your details and we’ll get in touch with you!

NEWS

INTERVIEWS

INSIGHTS

AI RADAR

SERVICES

SUBSCRIBE

CONTACT US

Brought to you by

To repurpose or use any of the content or material on this and our sister sites, explicit written permission needs to be sought. Copyright © 2025 AiThority. All Rights Reserved. Privacy Policy

Please fill your details and we’ll get in touch with you!

To repurpose or use any of the content or material on this and our sister sites, explicit written permission needs to be sought.

Copyright © 2025 AiThority. All Rights Reserved. Privacy Policy