Q: Which animal hit the cat? Answer it simply.
Q: Which animal hit the cat? Answer it simply.
Q: How many main characters are there in the video?
Q: What animal saves the monkey? Answer it simply.
The increasing demand to process long and high-resolution videos significantly burdens Large Vision-Language Models (LVLMs) due to the enormous number of visual tokens. Existing token reduction methods primarily focus on importance-based token pruning, which overlooks the redundancy caused by frame resemblance and repetitive visual elements.
We analyze the high vision token similarities in LVLMs and reveal that token similarity distribution condenses as layers deepen while maintaining ranking consistency. Leveraging the unique properties of similarity over importance, we introduce FrameFusion, a novel approach that combines similarity-based merging with importance-based pruning for better token reduction in LVLMs.
Experiments show that FrameFusion reduces vision tokens by 70%, achieving 3.4 – 4.4× LLM speedups and 1.6 – 1.9× end-to-end speedups, with an average performance impact of less than 3%.
The core concept of FrameFusion is to combine similarity-based merging with importance-based pruning. Unlike traditional methods that primarily employ importance-based token pruning, FrameFusion emphasizes similarity-based token merging, retaining only tokens that are both important and unique.
In the initial merging stage, FrameFusion utilizes token similarity to merge vision tokens. It computes token similarities between corresponding vision tokens of adjacent frames. Tokens exceeding the similarity threshold are grouped with their analogous tokens from the previous frame. Within each group, FrameFusion performs element-wise averaging of all tokens.
The merging stage is applied across successive shallow LLM layers to progressively merge similar tokens, until the number of similar tokens falls below a threshold. After the merging stage, the remaining unique tokens advance to the pruning stage.
After merging, FrameFusion further prunes unimportant tokens using cumulative attention scores. Based on a user-defined computational cost budget, it calculates the maximum number of remaining tokens that fits within the budget, then applies top-k importance pruning to retain only the important tokens from the remaining unique ones.
High similarity between adjacent frames
Design Choice: O(N) adjacent-only computation
Token similarity distribution condenses as layers deepen
Design Choice: Apply merging at shallow layers
High similarity ranking consistency across layers
Design Choice: Cascaded merging strategy
Token Budget | VideoMME | NExt-QA-MC | NExt-QA-OE | Max Drop |
---|---|---|---|---|
30% | 61.3 (-3.0%) | 81.8 (-1.7%) | 31.7 (-1.2%) | 3.0% |
50% | 62.6 (-0.9%) | 82.7 (-0.6%) | 32.1 (0.0%) | 0.9% |
70% | 63.0 (-0.3%) | 82.8 (-0.5%) | 32.1 (0.0%) | 0.5% |
FrameFusion achieves:
Compare the original video frames with FrameFusion-processed frames. Use the slider to see how our method maintains visual quality while reducing tokens.
Original Frame
(256 tokens)
After FrameFusion
(77 tokens)
Token reduction maintains semantic understanding while significantly reducing computation.
@article{fu2024framefusion,
title = {FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models},
author = {Fu, Tianyu and Liu, Tengxuan and Han, Qinghao and Dai, Guohao and Yan, Shengen and Yang, Huazhong and Ning, Xuefei and Wang, Yu},
journal = {arXiv preprint arXiv:2501.01986},
year = {2024}
}