Space Filling Curves is All You Need: Communication-Avoiding Matrix Multiplication Made Simple

Agentic AI

Published: arXiv: 2601.16294v1

Authors

Evangelos Georganas Alexander Heinecke Pradeep Dubey

Abstract

General Matrix Multiplication (GEMM) is the cornerstone of Deep Learning and HPC workloads; accordingly, academia and industry have heavily optimized this kernel. Modern platforms with matrix multiplication accelerators exhibit high FLOP/Byte machine balance, which makes implementing optimal matrix multiplication challenging. On modern CPU platforms with matrix engines, state-of-the-art vendor libraries tune input tensor layouts, parallelization schemes, and cache blocking to minimize data movement across the memory hierarchy and maximize throughput. However, the best settings for these parameters depend strongly on the target platform (number of cores, memory hierarchy, cache sizes) and on the shapes of the matrices, making exhaustive tuning infeasible; in practice this leads to performance "glass jaws". In this work we revisit space filling curves (SFC) to alleviate the problem of this cumbersome tuning. SFC convert multi-dimensional coordinates (e.g. 2D) into a single dimension (1D), keeping nearby points in the high-dimensional space close in the 1D order. We partition the Matrix Multiplication computation space using recent advancements in generalized SFC (Generalized Hilbert Curves), and we obtain platform-oblivious and shape-oblivious matrix-multiplication schemes that exhibit inherently high degree of data locality. Furthermore, we extend the SFC-based work partitioning to implement Communication-Avoiding (CA) algorithms that replicate the input tensors and provably minimize communication/data-movement on the critical path. The integration of CA-algorithms is seamless and yields compact code (~30 LOC), yet it achieves state-of-the-art results on multiple CPU platforms, outperforming vendor libraries by up to 2x(geometric-mean speedup) for a range of GEMM shapes.

Paper Summary

Problem

Deep learning and high-performance computing rely heavily on a fundamental operation called General Matrix Multiplication (GEMM). However, modern platforms with matrix multiplication accelerators make it challenging to implement optimal GEMM due to their high FLOP/byte machine balance. Current vendor libraries optimize input tensor layouts, parallelization schemes, and cache blocking to minimize data movement, but the best settings depend on the platform and matrix shapes, making exhaustive tuning infeasible.

Key Innovation

Researchers have revisited space-filling curves (SFC) to alleviate the problem of cumbersome tuning. They used recent advancements in generalized SFC (Generalized Hilbert Curves) to partition the GEMM computation space and obtain platform-oblivious and shape-oblivious matrix-multiplication schemes with high data locality. This innovation enables the implementation of Communication-Avoiding (CA) algorithms that provably minimize communication and data movement on the critical path.

Practical Impact

The integration of CA-algorithms into the SFC-based work partitioning yields compact code (∼30 LOC) that achieves state-of-the-art results on multiple CPU platforms. This research outperforms vendor libraries by up to 2× (geometric-mean speedup) for a range of GEMM shapes, making it a significant improvement for deep learning and high-performance computing applications. The seamless integration of CA-algorithms into the SFC-based framework makes it easy to adopt and implement in various platforms and use cases.

Analogy / Intuitive Explanation

Think of space-filling curves as a way to organize a large library of books in a single row, while keeping nearby books close to each other in the row. This organization enables efficient access to books, similar to how SFC-based matrix multiplication enables efficient access to data in the memory hierarchy, reducing communication and data movement. The analogy highlights the key innovation of using SFC to improve data locality and reduce communication in matrix multiplication.

Paper Information

Categories:

cs.DC cs.AI

Published Date:

arXiv ID:

2601.16294v1

Quick Actions

Back to Home