Latency and Token-Aware Test-Time Compute

Generative AI & LLMs
Published: arXiv: 2509.09864v1
Authors

Jenny Y. Huang Mehul Damani Yousef El-Kurdi Ramon Astudillo Wei Sun

Abstract

Inference-time scaling has emerged as a powerful way to improve large language model (LLM) performance by generating multiple candidate responses and selecting among them. However, existing work on dynamic allocation for test-time compute typically considers only parallel generation methods such as best-of-N, overlooking incremental decoding methods like beam search, and has largely ignored latency, focusing only on token usage. We formulate inference-time scaling as a problem of dynamic compute allocation and method selection, where the system must decide which strategy to apply and how much compute to allocate on a per-query basis. Our framework explicitly incorporates both token cost and wall-clock latency, the latter being critical for user experience and particularly for agentic workflows where models must issue multiple queries efficiently. Experiments on reasoning benchmarks show that our approach consistently outperforms static strategies, achieving favorable accuracy-cost trade-offs while remaining practical for deployment.

Paper Summary

Problem
Large language models (LLMs) have shown great promise in reasoning-intensive domains, but their performance is often limited by the amount of computation they can perform at inference time. Current approaches to inference-time scaling, which involve generating multiple candidate responses and selecting the best one, can be expensive and inefficient. The problem is that fixed strategies may overspend on simple cases while under-provisioning harder ones, leading to a significant computational burden.
Key Innovation
This research proposes a new framework for inference-time scaling that addresses the problem of inefficient computation. The framework, called Latency and Token-Aware Test-Time Compute, jointly determines which strategy to apply and how much compute to allocate per query, taking into account both token cost and wall-clock latency. This is a significant innovation because it moves beyond prior work that focused solely on token usage and ignores latency, which is critical for user experience.
Practical Impact
The practical impact of this research is significant. By developing a framework that can adapt to query difficulty and allocate compute efficiently, LLMs can achieve better accuracy-efficiency trade-offs. This means that LLMs can perform more complex tasks without breaking the bank, making them more practical for deployment in real-world applications. Additionally, this research has the potential to improve the efficiency of agentic workflows, where models must issue multiple queries and efficiency becomes critical.
Analogy / Intuitive Explanation
Imagine you're trying to solve a math problem, and you're not sure if you have the right answer. A traditional LLM would give you one answer and hope it's correct. But with this new framework, the LLM would generate multiple candidate answers and select the best one based on its confidence level. This is like having a team of experts working together to solve the problem, each contributing their own ideas and expertise. By allocating compute efficiently and selecting the best strategy, the LLM can achieve better accuracy and efficiency, just like a team of experts working together.
Paper Information
Categories:
cs.LG cs.AI cs.CL
Published Date:

arXiv ID:

2509.09864v1

Quick Actions