Original Paper : Focus: A Streaming Concentration Architecture for Efficient Vision-Language Models
OpenSource repo : https://github.com/dubcyfor3/Focus
Abstract
Focus introduces a multilevel concentration paradigm that hierarchically compresses vision-language inputs at three levels: (1) semantic-guided token pruning based on textual prompts, (2) spatial-temporal block-level concentration using localized comparisons, and (3) vector-level redundancy removal via motion-aware matching.
Focus leverages GEMM tiling, convolution-style layout, and cross-modal attention to minimize off-chip access while enabling high throughput.
Key words
token pruning
- Deleting token based on importance score.
- “중요하지 않으니 버리자”
token merging
- Merging similar token based on semantic information.
- “비슷한 게 많으니 합치자”
Introduction
Hardware perspective
AdapTiV implements a simplified ToMe module in hardware.
CMC leverages video-codec-inspired compression (e.g., H.264 [65]) via an external codec block.
However, both approaches largely translate existing algorithms without embracing full hardware- algorithm co-design.
ViT와 같은 인코더 가속 혹은 Multi-modal을 생각하지 않고 Visual만 가속하거나, video input을 고려하지 않은 가속기를 설계했다.
In this study, we propose a novel architecture, Focus, to accelerate VLM inference by performing streaming concentration, a multilevel compression technique that removes visual and cross-modal redundancy in a streaming-friendly, on-chip processing fashion.
multilevel concentration(algorithmic perspective)
-
프롬프트와 관계된 visual regions token prunning을 진행한다.
prompt-aware importance analyzer that dynamically prunes visual tokens based on cross-modal attention, improving both accuracy and efficiency. -
시공간 기준 특정 patch의 근접 7개 패치(현재 3개, 과거 4개)를 보고 유사성을 판단함.
-
vector 기준, 패치의 시간 도메인 상에서 shifting 된 부분을 찾아서 redundancy를 없앤다. 통으로 비교하는게 아니라 벡터 안에서의 부분적으로도 비교함으로 정보량을 보존시킨다.
multilevel concentration(architectural perspective)
block level
TBConitnue
Personal Insights
기본적으로 대체로 추론 가속을 위해서는 1) sparsity 관리 2) redundancy 관리 3) quantization 4) prunning 5) parallelism