Abstract
Prism addresses inefficiencies in block-sparse attention for long-context LLM pre-filling by using a spectral-aware approach that improves block selection accuracy through energy-based temperature calibration.
Block-sparse attention is promising for accelerating long-context LLM pre-filling, yet identifying relevant blocks efficiently remains a bottleneck. Existing methods typically employ coarse-grained attention as a proxy for block importance estimation, but often resort to expensive token-level searching or scoring, resulting in significant selection overhead. In this work, we trace the inaccuracy of standard coarse-grained attention via mean pooling to a theoretical root cause: the interaction between mean pooling and Rotary Positional Embeddings (RoPE). We prove that mean pooling acts as a low-pass filter that induces destructive interference in high-frequency dimensions, effectively creating a "blind spot" for local positional information (e.g., slash patterns). To address this, we introduce Prism, a training-free spectral-aware approach that decomposes block selection into high-frequency and low-frequency branches. By applying energy-based temperature calibration, Prism restores the attenuated positional signals directly from pooled representations, enabling block importance estimation using purely block-level operations, thereby improving efficiency. Extensive evaluations confirm that Prism maintains accuracy parity with full attention while delivering up to 5.1times speedup.
Community
TL;DR
Prism is a training-free method to accelerate long-context LLM pre-filling. It addresses the "blind spot" in standard mean pooling caused by Rotary Positional Embeddings (RoPE) by disentangling attention into high-frequency and low-frequency bands.
Key Features:
- Dual-Band Importance Estimation: Separates semantic (low-freq) and positional (high-freq) signals.
- Energy-Based Calibration: Restores attenuated signals automatically.
- Speed: Up to 5.1× speedup on 128K context with negligible accuracy loss.
- Implementation: Purely block-level ops with custom kernels for efficient estimation.
Check out this blog introducing the idea of Prism:
https://efficacious-citrus-7a0.notion.site/Prism-Spectral-Aware-Block-Sparse-Attention-304d97f5df9d80318802f9cb37d18c3e
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- A Unified Sparse Attention via Multi-Granularity Compression (2025)
- Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection (2026)
- RRAttention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference (2026)
- SPLA: Block Sparse Plus Linear Attention for Long Context Modeling (2026)
- Scout Before You Attend: Sketch-and-Walk Sparse Attention for Efficient LLM Inference (2026)
- STILL: Selecting Tokens for Intra-Layer Hybrid Attention to Linearize LLMs (2026)
- Double-P: Hierarchical Top-P Sparse Attention for Long-Context LLMs (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper