What Matters in Virtual Try-Off? Dual-UNet Diffusion Model For Garment Reconstruction

AI in healthcare

Published: arXiv: 2604.08716v1

Authors

Loc-Phat Truong Meysam Madadi Sergio Escalera

Abstract

Virtual Try-On (VTON) has seen rapid advancements, providing a strong foundation for generative fashion tasks. However, the inverse problem, Virtual Try-Off (VTOFF)-aimed at reconstructing the canonical garment from a draped-on image-remains a less understood domain, distinct from the heavily researched field of VTON. In this work, we seek to establish a robust architectural foundation for VTOFF by studying and adapting various diffusion-based strategies from VTON and general Latent Diffusion Models (LDMs). We focus our investigation on the Dual-UNet Diffusion Model architecture and analyze three axes of design: (i) Generation Backbone: comparing Stable Diffusion variants; (ii) Conditioning: ablating different mask designs, masked/unmasked inputs for image conditioning, and the utility of high-level semantic features; and (iii) Losses and Training Strategies: evaluating the impact of the auxiliary attention-based loss, perceptual objectives and multi-stage curriculum schedules. Extensive experiments reveal trade-offs across various configuration options. Evaluated on VITON-HD and DressCode datasets, our framework achieves state-of-the-art performance with a drop of 9.5\% on the primary metric DISTS and competitive performance on LPIPS, FID, KID, and SSIM, providing both stronger baselines and insights to guide future Virtual Try-Off research.

Paper Summary

Problem

Virtual Try-Off (VTOFF) is the inverse problem of Virtual Try-On (VTON), where the goal is to reconstruct the original garment from a draped-on image. This task is challenging due to partial views and occluded regions in specific body poses, making it difficult to accurately generate the invisible garment appearance and shape.

Key Innovation

The research paper proposes a new framework for VTOFF using a Dual-UNet Diffusion Model architecture, which combines the strengths of Stable Diffusion variants, Latent Diffusion Models (LDMs), and auxiliary modules to prevent fine-grained details distortion. The framework consists of two branches: a Generation branch that creates the garment image and a Conditioning branch that uses high-level semantic features to condition the generation process.

Practical Impact

The proposed framework can be applied in various real-world scenarios, such as product retrieval, fashion dataset construction, and person-to-person Virtual Try-On tasks. By achieving state-of-the-art performance on VITON-HD and DressCode datasets, the framework provides a strong foundation for future VTOFF research and applications. Additionally, the insights gained from this research can be used to improve the features shared between VTOFF and VTON, leading to better overall performance in Virtual Try-On tasks.

Analogy / Intuitive Explanation

Imagine trying to reconstruct a puzzle from a partially completed picture. In VTOFF, the input image is like the partially completed puzzle, and the goal is to fill in the missing pieces to create the original garment. The proposed framework uses a combination of techniques to "solve" the puzzle, generating a more accurate and realistic representation of the garment.

Paper Information

Categories:

cs.CV

Published Date:

arXiv ID:

2604.08716v1

Quick Actions

Back to Home