$p1$: Better Prompt Optimization with Fewer Prompts

Explainable & Ethical AI
Published: arXiv: 2604.08801v1
Authors

Zhaolin Gao Yu Wang Bo Liu Thorsten Joachims Kianté Brantley Wen Sun

Abstract

Prompt optimization improves language models without updating their weights by searching for a better system prompt, but its effectiveness varies widely across tasks. We study what makes a task amenable to prompt optimization. We show that the reward variance across different system prompts can be decomposed into two components: variance among responses, which captures generation stochasticity, and variance among system prompts, which captures differences in system prompt quality. Prompt optimization succeeds when variance among system prompts is sufficiently large, but fails when variance among responses dominates the variance of the system prompts. Surprisingly, we further show that scaling to more user prompts can hurt optimization by reducing variance among system prompts, especially on heterogeneous datasets where different user prompts favor different system prompts. Motivated by this insight, we propose $p1$, a simple user prompt filtering method that selects a small subset of user prompts with high variance across candidate system prompts. This subset of user prompts allows one to distinguish a good system prompt from a bad one, making system optimization easier. Experiments on reasoning benchmarks show that $p1$ substantially improves prompt optimization over training on the full dataset and outperforms strong baselines such as GEPA. Notably, training on only two prompts from AIME 24 yields a system prompt that generalizes well to other reasoning benchmarks.

Paper Summary

Problem
The main problem this paper addresses is the inconsistent performance of prompt optimization in language models. Despite its potential to improve task performance without updating the model's weights, prompt optimization often fails to yield significant gains on certain tasks. The researchers aim to understand what makes a task amenable to prompt optimization and how to improve its effectiveness.
Key Innovation
The key innovation of this paper is the proposal of a simple user prompt filtering method called p1. p1 selects a small subset of user prompts with high variance among system prompts, allowing for easier optimization of system prompts. This approach is motivated by the observation that increasing the number of user prompts can reduce variance among system prompts, especially on heterogeneous tasks.
Practical Impact
The practical impact of this research is significant, as it provides a new method for improving prompt optimization in language models. By selecting a subset of user prompts with high variance among system prompts, p1 can substantially improve prompt optimization on reasoning benchmarks. This can lead to better performance on tasks that rely on language models, such as question-answering, text summarization, and language translation.
Analogy / Intuitive Explanation
Imagine you're trying to find the best way to get to a destination, and you have multiple maps (user prompts) to choose from. Some maps are very good at providing the best route, while others are not as accurate. The problem is that if you have too many maps, it becomes harder to find the best one. p1 is like a filter that helps you select the best maps (user prompts) by looking for those that provide the most varied and accurate routes (high variance among system prompts). This allows you to find the best system prompt more easily, leading to better performance on tasks.
Paper Information
Categories:
cs.LG cs.CL
Published Date:

arXiv ID:

2604.08801v1

Quick Actions