Multipole Semantic Attention: A Fast Approximation of Softmax Attention for Pretraining

Generative AI & LLMs
Published: arXiv: 2509.10406v1
Authors

Rupert Mitchell Kristian Kersting

Abstract

We present Multipole Semantic Attention (MuSe), an efficient approximation of softmax attention that combines semantic clustering with multipole expansions from computational physics. Our method addresses the quadratic computational complexity of transformers in the context length by clustering queries and keys separately in their learned representation spaces, enabling a hierarchical two-stage attention mechanism. Unlike prior clustering approaches that group only keys or use unified clustering, we maintain separate clusterings that respect attention's asymmetric treatment of these spaces. We augment centroid-based (monopole) approximations with dipole corrections that capture directional variance within clusters, preserving richer information during training. The method operates as a drop-in replacement for standard attention, requiring only hyperparameter specification without architectural modifications. Our approach achieves $\mathcal{O}(NCD)$ complexity for acausal attention with $C$ clusters and $\mathcal{O}(NCD \log N)$ for causal attention. On isolated attention layers, we demonstrate $3\times$ speedup over CUDNN Flash Attention at 8k context length, with relative squared errors below 20%. For causal attention, we develop a hierarchical block decomposition that combines exact local computation with efficient long-range approximation. In end-to-end pretraining of a 30M parameter model on book-length texts with 16k context, we achieve 12.2% runtime reduction with only 0.36% loss degradation, establishing the viability of multipole approximations for efficient transformer pretraining.

Paper Summary

Problem
The main problem addressed in this research paper is the quadratic computational complexity of softmax attention in transformers, which limits context length and makes long-context pretraining expensive.
Key Innovation
The key innovation presented in this paper is Multipole Semantic Attention (MuSe), a fast approximation of softmax attention that combines semantic clustering with multipole expansions from computational physics. MuSe achieves a complexity linear in context length for acausal attention and log-linear in context length for causal attention, making it a more efficient alternative to traditional attention mechanisms.
Practical Impact
This research has significant practical implications for the field of natural language processing (NLP). By enabling efficient long-context pretraining, MuSe can be used to train larger models that can capture more complex relationships between words and improve the performance of NLP tasks such as language translation, text summarization, and question answering. The authors demonstrate a 12.2% runtime reduction with only 0.36% loss degradation in end-to-end pretraining of a 30M parameter model on book-length texts with 16k context.
Analogy / Intuitive Explanation
Think of MuSe as a way to group similar words together and then approximate the attention mechanism using a simplified model. Imagine you're trying to understand a long conversation between multiple people. Instead of listening to every single person individually, you group similar topics or themes together and focus on the main ideas. This is similar to how MuSe clusters similar words together and approximates the attention mechanism, making it more efficient and scalable.
Paper Information
Categories:
cs.LG 68W25, 68T50 (primary) 68W40, 68T07 (secondary) I.2.6; I.2.7
Published Date:

arXiv ID:

2509.10406v1

Quick Actions