Latency and Token-Aware Test-Time Compute

Generative AI & LLMs

Published: arXiv: 2509.09864v1

Authors

Jenny Y. Huang Mehul Damani Yousef El-Kurdi Ramon Astudillo Wei Sun

Abstract

Inference-time scaling has emerged as a powerful way to improve large language model (LLM) performance by generating multiple candidate responses and selecting among them. However, existing work on dynamic allocation for test-time compute typically considers only parallel generation methods such as best-of-N, overlooking incremental decoding methods like beam search, and has largely ignored latency, focusing only on token usage. We formulate inference-time scaling as a problem of dynamic compute allocation and method selection, where the system must decide which strategy to apply and how much compute to allocate on a per-query basis. Our framework explicitly incorporates both token cost and wall-clock latency, the latter being critical for user experience and particularly for agentic workflows where models must issue multiple queries efficiently. Experiments on reasoning benchmarks show that our approach consistently outperforms static strategies, achieving favorable accuracy-cost trade-offs while remaining practical for deployment.

Paper Summary

Problem

Large language models (LLMs) have shown great promise in reasoning-intensive domains, but their performance is often limited by the amount of computation they can perform at inference time. Current approaches to inference-time scaling, which involve generating multiple candidate responses and selecting the best one, can be expensive and inefficient. The problem is that fixed strategies may overspend on simple cases while under-provisioning harder ones, leading to a significant computational burden.

Key Innovation

This research proposes a new framework for inference-time scaling that addresses the problem of inefficient computation. The framework, called Latency and Token-Aware Test-Time Compute, jointly determines which strategy to apply and how much compute to allocate per query, taking into account both token cost and wall-clock latency. This is a significant innovation because it moves beyond prior work that focused solely on token usage and ignores latency, which is critical for user experience.

Practical Impact

The practical impact of this research is significant. By developing a framework that can adapt to query difficulty and allocate compute efficiently, LLMs can achieve better accuracy-efficiency trade-offs. This means that LLMs can perform more complex tasks without breaking the bank, making them more practical for deployment in real-world applications. Additionally, this research has the potential to improve the efficiency of agentic workflows, where models must issue multiple queries and efficiency becomes critical.

Analogy / Intuitive Explanation

Imagine you're trying to solve a math problem, and you're not sure if you have the right answer. A traditional LLM would give you one answer and hope it's correct. But with this new framework, the LLM would generate multiple candidate answers and select the best one based on its confidence level. This is like having a team of experts working together to solve the problem, each contributing their own ideas and expertise. By allocating compute efficiently and selecting the best strategy, the LLM can achieve better accuracy and efficiency, just like a team of experts working together.

Paper Information

Categories:

cs.LG cs.AI cs.CL

Published Date:

arXiv ID:

2509.09864v1

Quick Actions

Back to Home