Disentangled representations via score-based variational autoencoders

AI in healthcare
Published: arXiv: 2512.17127v1
Authors

Benjamin S. H. Lyo Eero P. Simoncelli Cristina Savin

Abstract

We present the Score-based Autoencoder for Multiscale Inference (SAMI), a method for unsupervised representation learning that combines the theoretical frameworks of diffusion models and VAEs. By unifying their respective evidence lower bounds, SAMI formulates a principled objective that learns representations through score-based guidance of the underlying diffusion process. The resulting representations automatically capture meaningful structure in the data: it recovers ground truth generative factors in our synthetic dataset, learns factorized, semantic latent dimensions from complex natural images, and encodes video sequences into latent trajectories that are straighter than those of alternative encoders, despite training exclusively on static images. Furthermore, SAMI can extract useful representations from pre-trained diffusion models with minimal additional training. Finally, the explicitly probabilistic formulation provides new ways to identify semantically meaningful axes in the absence of supervised labels, and its mathematical exactness allows us to make formal statements about the nature of the learned representation. Overall, these results indicate that implicit structural information in diffusion models can be made explicit and interpretable through synergistic combination with a variational autoencoder.

Paper Summary

Problem
The main problem addressed in this research paper is how to enable intelligent agents to understand the world in a meaningful way. This involves breaking down complex sensory information into abstract and interpretable factors, a process known as disentangling. Disentangled representations have several benefits, including improved interpretability, enhanced generalization, and stronger transfer capabilities across different tasks.
Key Innovation
The key innovation of this work is the Score-based Autoencoder for Multiscale Inference (SAMI), a method that combines the strengths of variational autoencoders and diffusion models to learn structured, interpretable latent representations. SAMI uses a conditional diffusion model as the generative component of a VAE, which allows it to reuse the inference network for conditioning. This approach enables the model to learn a simple latent posterior approximation that imposes strong factorization constraints on the latent code, resulting in disentangled representations.
Practical Impact
This research has several practical implications. Firstly, SAMI can be used to extract useful representations from pre-trained diffusion models with minimal additional training. This can be particularly useful in applications where large amounts of data are available, but the model is not fully understood. Secondly, the explicitly probabilistic formulation of SAMI provides new ways to identify semantically meaningful axes in the absence of supervised labels. This can be useful in applications where labels are not available, but the model needs to understand the underlying structure of the data.
Analogy / Intuitive Explanation
Imagine you are trying to understand a complex painting. The painting is made up of many different colors, textures, and shapes, but you want to understand the underlying story or message that the artist is trying to convey. A disentangled representation would be like breaking down the painting into its individual components, such as the colors, textures, and shapes, and then understanding how they relate to each other to form the overall story. SAMI is like a tool that helps to break down complex data into its individual components, allowing us to understand the underlying structure and meaning of the data.
Paper Information
Categories:
stat.ML cs.LG
Published Date:

arXiv ID:

2512.17127v1

Quick Actions