Curriculum-DPO++: Direct Preference Optimization via Data and Model Curricula for Text-to-Image Generation

AI in healthcare
Published: arXiv: 2602.13055v1
Authors

Florinel-Alin Croitoru Vlad Hondru Radu Tudor Ionescu Nicu Sebe Mubarak Shah

Abstract

Direct Preference Optimization (DPO) has been proposed as an effective and efficient alternative to reinforcement learning from human feedback (RLHF). However, neither RLHF nor DPO take into account the fact that learning certain preferences is more difficult than learning other preferences, rendering the optimization process suboptimal. To address this gap in text-to-image generation, we recently proposed Curriculum-DPO, a method that organizes image pairs by difficulty. In this paper, we introduce Curriculum-DPO++, an enhanced method that combines the original data-level curriculum with a novel model-level curriculum. More precisely, we propose to dynamically increase the learning capacity of the denoising network as training advances. We implement this capacity increase via two mechanisms. First, we initialize the model with only a subset of the trainable layers used in the original Curriculum-DPO. As training progresses, we sequentially unfreeze layers until the configuration matches the full baseline architecture. Second, as the fine-tuning is based on Low-Rank Adaptation (LoRA), we implement a progressive schedule for the dimension of the low-rank matrices. Instead of maintaining a fixed capacity, we initialize the low-rank matrices with a dimension significantly smaller than that of the baseline. As training proceeds, we incrementally increase their rank, allowing the capacity to grow until it converges to the same rank value as in Curriculum-DPO. Furthermore, we propose an alternative ranking strategy to the one employed by Curriculum-DPO. Finally, we compare Curriculum-DPO++ against Curriculum-DPO and other state-of-the-art preference optimization approaches on nine benchmarks, outperforming the competing methods in terms of text alignment, aesthetics and human preference. Our code is available at https://github.com/CroitoruAlin/Curriculum-DPO.

Paper Summary

Problem
The main problem addressed in this paper is the inefficiency of Direct Preference Optimization (DPO) and reinforcement learning from human feedback (RLHF) in text-to-image generation. These methods fail to take into account the varying difficulty of learning certain preferences, leading to suboptimal optimization processes.
Key Innovation
The researchers propose Curriculum-DPO++, an enhanced method that combines data-level and model-level curricula to improve the efficiency of DPO. The model-level curriculum dynamically increases the learning capacity of the denoising network as training advances, allowing for better generalization.
Practical Impact
The practical impact of this research is significant, as it can be applied to various image generation tasks and fine-tune Large Language Models (LLMs) beyond image generation. The method's ability to increase learning capacity at the same pace with data complexity enables better generalization and produces models that are more inclined to follow the input prompt and generate images that are more visually appealing.
Analogy / Intuitive Explanation
Think of the learning process as a puzzle with increasing difficulty levels. Curriculum-DPO++ is like a dynamic puzzle solver that starts with a simplified version of the puzzle and gradually adds complexity as it progresses, allowing it to learn and adapt more efficiently. This approach enables the model to tackle increasingly challenging examples and produce better results.
Paper Information
Categories:
cs.CV cs.AI cs.LG
Published Date:

arXiv ID:

2602.13055v1

Quick Actions