Semantic Invariance in Agentic AI

Generative AI & LLMs

Published: arXiv: 2603.13173v1

Authors

I. de Zarzà J. de Curtò Jordi Cabot Pietro Manzoni Carlos T. Calafate

Abstract

Large Language Models (LLMs) increasingly serve as autonomous reasoning agents in decision support, scientific problem-solving, and multi-agent coordination systems. However, deploying LLM agents in consequential applications requires assurance that their reasoning remains stable under semantically equivalent input variations, a property we term semantic invariance.Standard benchmark evaluations, which assess accuracy on fixed, canonical problem formulations, fail to capture this critical reliability dimension. To address this shortcoming, in this paper we present a metamorphic testing framework for systematically assessing the robustness of LLM reasoning agents, applying eight semantic-preserving transformations (identity, paraphrase, fact reordering, expansion, contraction, academic context, business context, and contrastive formulation) across seven foundation models spanning four distinct architectural families: Hermes (70B, 405B), Qwen3 (30B-A3B, 235B-A22B), DeepSeek-R1, and gpt-oss (20B, 120B). Our evaluation encompasses 19 multi-step reasoning problems across eight scientific domains. The results reveal that model scale does not predict robustness: the smaller Qwen3-30B-A3B achieves the highest stability (79.6% invariant responses, semantic similarity 0.91), while larger models exhibit greater fragility.

Paper Summary

Problem

Large Language Models (LLMs) are being used as autonomous reasoning agents in various applications, but their reliability is a concern. These models are sensitive to superficial input variations that preserve semantic content, making them fragile and unreliable. This fragility undermines the trustworthiness of LLM agents in real-world deployments, where input formulations are inherently variable and uncontrolled.

Key Innovation

The researchers present a metamorphic testing framework for systematically assessing the robustness of LLM reasoning agents. This framework applies eight semantic-preserving transformations across seven foundation models, spanning four distinct architectural families. The framework evaluates the models' ability to produce consistent outputs when presented with semantically equivalent inputs.

Practical Impact

This research has significant implications for the deployment of LLM agents in high-stakes environments. The findings challenge conventional understanding of LLM agent capabilities and highlight the importance of considering robustness in addition to raw performance. The researchers' framework provides a tool for evaluating the reliability of LLM agents, which can inform decision-making in agent selection and deployment scenarios.

Analogy / Intuitive Explanation

Imagine a conversational AI assistant that can answer questions on various topics. If the assistant is asked the same question in different ways, but with the same meaning, it should provide the same answer. However, current LLMs are like a person who might give different answers to the same question depending on how it's phrased. The researchers' framework helps to test the reliability of these conversational AI assistants by asking them the same question in different ways and checking if they provide the same answer.

Paper Information

Categories:

cs.AI cs.CL

Published Date:

arXiv ID:

2603.13173v1

Quick Actions

Back to Home