Logo

FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models

1Tsinghua University, 2Infinigence-AI, 3Peking University, 4Shanghai Jiao Tong University
*Equal contribution
FrameFusion Overview

FrameFusion reduces vision tokens by 70% in video LVLMs, achieving 3.4-4.4× speedup with <3% accuracy drop.

Abstract

The increasing demand to process long and high-resolution videos significantly burdens Large Vision-Language Models (LVLMs) due to the enormous number of visual tokens. Existing token reduction methods primarily focus on importance-based token pruning, which overlooks the redundancy caused by frame resemblance and repetitive visual elements.

We analyze the high vision token similarities in LVLMs and reveal that token similarity distribution condenses as layers deepen while maintaining ranking consistency. Leveraging the unique properties of similarity over importance, we introduce FrameFusion, a novel approach that combines similarity-based merging with importance-based pruning for better token reduction in LVLMs.

Experiments show that FrameFusion reduces vision tokens by 70%, achieving 3.4 – 4.4× LLM speedups and 1.6 – 1.9× end-to-end speedups, with an average performance impact of less than 3%.

Method Overview

Method Overview

FrameFusion first merges tokens with similarities above a specified threshold at shallow layers, then applies top-k importance pruning to comply with the given computational constraints.

Two-Stage Token Compression

The core concept of FrameFusion is to combine similarity-based merging with importance-based pruning. Unlike traditional methods that primarily employ importance-based token pruning, FrameFusion emphasizes similarity-based token merging, retaining only tokens that are both important and unique.

Merging Stage

In the initial merging stage, FrameFusion utilizes token similarity to merge vision tokens. It computes token similarities between corresponding vision tokens of adjacent frames. Tokens exceeding the similarity threshold are grouped with their analogous tokens from the previous frame. Within each group, FrameFusion performs element-wise averaging of all tokens.

The merging stage is applied across successive shallow LLM layers to progressively merge similar tokens, until the number of similar tokens falls below a threshold. After the merging stage, the remaining unique tokens advance to the pruning stage.

Pruning Stage

After merging, FrameFusion further prunes unimportant tokens using cumulative attention scores. Based on a user-defined computational cost budget, it calculates the maximum number of remaining tokens that fits within the budget, then applies top-k importance pruning to retain only the important tokens from the remaining unique ones.

Key Observations & Design

Observation 1:
Adjacent Similarity

Token similarity between frames

High similarity between adjacent frames

Design Choice: O(N) adjacent-only computation

Observation 2:
Layer-wise Distribution

Token similarity distribution across layers

Token similarity distribution condenses as layers deepen

Design Choice: Apply merging at shallow layers

Observation 3:
Ranking Consistency

Similarity ranking consistency

High similarity ranking consistency across layers

Design Choice: Cascaded merging strategy

Results

Performance vs. Token Budget

Token Budget VideoMME NExt-QA-MC NExt-QA-OE Max Drop
30% 61.3 (-3.0%) 81.8 (-1.7%) 31.7 (-1.2%) 3.0%
50% 62.6 (-0.9%) 82.7 (-0.6%) 32.1 (0.0%) 0.9%
70% 63.0 (-0.3%) 82.8 (-0.5%) 32.1 (0.0%) 0.5%

Speedup Across Models

Speedup across different model sizes

FrameFusion achieves:

  • 3.4 – 4.4× LLM speedup
  • 1.6 – 1.9× end-to-end speedup
  • Scales better with larger models

Interactive Demo

Compare the original video frames with FrameFusion-processed frames. Use the slider to see how our method maintains visual quality while reducing tokens.

Original frame

Original Frame
(256 tokens)

Loading...
Pruned frame

After FrameFusion
(77 tokens)

Token reduction maintains semantic understanding while significantly reducing computation.

BibTeX

@article{fu2024framefusion,
  title     = {FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models},
  author    = {Fu, Tianyu and Liu, Tengxuan and Han, Qinghao and Dai, Guohao and Yan, Shengen and Yang, Huazhong and Ning, Xuefei and Wang, Yu},
  journal   = {arXiv preprint arXiv:2501.01986},
  year      = {2024}
}