FrameFusion: Efficient Token Reduction for Vision-Language Models

Abstract

The increasing demand to process long and high-resolution videos significantly burdens Large Vision-Language Models (LVLMs) due to the enormous number of visual tokens. Existing token reduction methods primarily focus on importance-based token pruning, which overlooks the redundancy caused by frame resemblance and repetitive visual elements.

We analyze the high vision token similarities in LVLMs and reveal that token similarity distribution condenses as layers deepen while maintaining ranking consistency. Leveraging the unique properties of similarity over importance, we introduce FrameFusion, a novel approach that combines similarity-based merging with importance-based pruning for better token reduction in LVLMs.

Experiments show that FrameFusion reduces vision tokens by 70%, achieving 1.6 – 3.6× end-to-end speedups, with an average performance impact of less than 3%.

Method Overview

FrameFusion first merges tokens with similarities above a specified threshold at shallow layers, then applies top-k importance pruning to comply with the given computational constraints.

Two-Stage Token Compression

The core concept of FrameFusion is to combine similarity-based merging with importance-based pruning. Unlike traditional methods that primarily employ importance-based token pruning, FrameFusion emphasizes similarity-based token merging, retaining only tokens that are both important and unique.

Merging Stage

In the initial merging stage, FrameFusion utilizes token similarity to merge vision tokens. It computes token similarities between corresponding vision tokens of adjacent frames. Tokens exceeding the similarity threshold are grouped with their analogous tokens from the previous frame. Within each group, FrameFusion performs element-wise averaging of all tokens.

The merging stage is applied across successive shallow LLM layers to progressively merge similar tokens, until the number of similar tokens falls below a threshold. After the merging stage, the remaining unique tokens advance to the pruning stage.

Pruning Stage

After merging, FrameFusion further prunes unimportant tokens using cumulative attention scores. Based on a user-defined computational cost budget, it calculates the maximum number of remaining tokens that fits within the budget, then applies top-k importance pruning to retain only the important tokens from the remaining unique ones.

Key Observations & Design

Observation 1:
Adjacent Similarity

High similarity between adjacent frames

Design Choice: O(N) adjacent-only computation

Observation 2:
Layer-wise Distribution

Token similarity distribution across layers

Token similarity distribution condenses as layers deepen

Design Choice: Apply merging at shallow layers

Observation 3:
Ranking Consistency

High similarity ranking consistency across layers

Design Choice: Cascaded merging strategy

Results

Performance Comparison

Performance comparison across models, methods, and benchmarks at 30% token budget

Model	Size	Method	VideoNIAH	Edit	Insert1	Insert2	NExT-QA	MC	OE	VideoMME	w/o sub.	w/ sub.	EgoSchema	MVBench	Average

Runtime Comparison

FrameFusion runtime performance with different token budgets and frame counts. Select a model and cost level to see the comparison with the original model.

FrameFusion achieves 1.6 – 3.6× end-to-end speedup, scaling better with larger models and more frames.

Speedup Summary

Select a model and cost level to see speedup statistics.

Interactive Demo

Compare the original video frames with FrameFusion-processed frames. Use the slider to see how our method maintains visual quality while reducing tokens.

Original Frame
(256 tokens)

Loading...

After FrameFusion
(77 tokens)

Token reduction maintains semantic understanding while significantly reducing computation.

BibTeX

@article{fu2024framefusion,
  title     = {FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models},
  author    = {Fu, Tianyu and Liu, Tengxuan and Han, Qinghao and Dai, Guohao and Yan, Shengen and Yang, Huazhong and Ning, Xuefei and Wang, Yu},
  journal   = {arXiv preprint arXiv:2501.01986},
  year      = {2024}
}

FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models

FrameFusion reduces vision tokens by 70% in video LVLMs, achieving 1.6-3.6× end-to-end speedup with <3% accuracy drop.