Recomposer: Event-roll-guided generative audio editing

Generative AI & LLMs
Published: arXiv: 2509.05256v1
Authors

Daniel P. W. Ellis Eduardo Fonseca Ron J. Weiss Kevin Wilson Scott Wisdom Hakan Erdogan John R. Hershey Aren Jansen R. Channing Moore Manoj Plakal

Abstract

Editing complex real-world sound scenes is difficult because individual sound sources overlap in time. Generative models can fill-in missing or corrupted details based on their strong prior understanding of the data domain. We present a system for editing individual sound events within complex scenes able to delete, insert, and enhance individual sound events based on textual edit descriptions (e.g., ``enhance Door'') and a graphical representation of the event timing derived from an ``event roll'' transcription. We present an encoder-decoder transformer working on SoundStream representations, trained on synthetic (input, desired output) audio example pairs formed by adding isolated sound events to dense, real-world backgrounds. Evaluation reveals the importance of each part of the edit descriptions -- action, class, timing. Our work demonstrates ``recomposition'' is an important and practical application.

Paper Summary

Problem
Editing complex real-world sound scenes can be challenging because individual sound sources often overlap in time. Traditional audio editing software allows for direct modification of specific parts of the waveform, but this approach can be difficult when dealing with overlapping events.
Key Innovation
The Recomposer system introduces a new approach to sound-event-oriented editing, allowing users to delete, insert, and enhance individual sound events within complex scenes based on textual edit descriptions and graphical representations of event timing. The system uses an encoder-decoder transformer trained on synthetic audio example pairs formed by adding isolated sound events to dense, real-world backgrounds.
Practical Impact
The Recomposer system has the potential to revolutionize audio editing by enabling users to make precise edits to individual sound events within complex scenes. This technology could be used in a variety of applications, such as film and television post-production, music production, and even live event sound design.
Analogy / Intuitive Explanation
Imagine trying to edit a busy street scene by changing the volume of specific sounds, like car horns or chirping birds. The Recomposer system allows you to do just that – identify specific sounds (events) within the scene and make precise edits to them, without affecting the rest of the audio. This is like having a "sound-editing wand" that lets you target specific sounds and adjust their volume, pitch, or even remove them altogether!
Paper Information
Categories:
cs.SD cs.AI cs.LG eess.AS
Published Date:

arXiv ID:

2509.05256v1

Quick Actions