Generating Literature-Driven Scientific Theories at Scale

AI in healthcare
Published: arXiv: 2601.16282v1
Authors

Peter Jansen Peter Clark Doug Downey Daniel S. Weld

Abstract

Contemporary automated scientific discovery has focused on agents for generating scientific experiments, while systems that perform higher-level scientific activities such as theory building remain underexplored. In this work, we formulate the problem of synthesizing theories consisting of qualitative and quantitative laws from large corpora of scientific literature. We study theory generation at scale, using 13.7k source papers to synthesize 2.9k theories, examining how generation using literature-grounding versus parametric knowledge, and accuracy-focused versus novelty-focused generation objectives change theory properties. Our experiments show that, compared to using parametric LLM memory for generation, our literature-supported method creates theories that are significantly better at both matching existing evidence and at predicting future results from 4.6k subsequently-written papers

Paper Summary

Problem
The main challenge addressed by this research paper is the lack of automated systems that can generate scientific theories from large corpora of scientific literature. While AI systems can generate scientific experiments, they struggle to perform higher-level scientific activities like theory building.
Key Innovation
The researchers introduce a novel system called THEORIZER, which reads tens of thousands of papers to generate numerous candidate theories. They explore two variants of THEORIZER: a literature-supported method and a simpler LLM baseline. This system is capable of synthesizing theories consisting of qualitative and quantitative laws from large corpora of scientific literature.
Practical Impact
This research has significant practical implications for the scientific community. Automated theory generation systems like THEORIZER can provide high-value guidance for future experiments, allowing scientists to compress knowledge within a scientific domain into a set of governing laws that accurately predict the outcomes of future experiments. This can lead to more systematic translation of empirically observed regularities into useful and impactful technologies.
Analogy / Intuitive Explanation
Think of THEORIZER as a librarian who reads through thousands of scientific papers to summarize the key findings and laws of a particular field. The librarian then uses this knowledge to generate a set of theories that can explain and predict future results. Just as a good librarian can help you find the most relevant information, THEORIZER can help scientists generate theories that are more accurate and reliable.
Paper Information
Categories:
cs.CL cs.AI
Published Date:

arXiv ID:

2601.16282v1

Quick Actions