**Training Diffusion Models with Reinforcement Learning: Unlocking the Potential**
Diffusion models have gained widespread recognition as the go-to method for generating complex and high-dimensional outputs, such as AI art and synthetic images. These models have found success in various applications, including drug design and continuous control. While diffusion models are primarily designed to match training data, their true potential lies in achieving downstream objectives. This article explores how diffusion models can be trained directly on these objectives using reinforcement learning (RL).
**Reinforcement Learning for Diffusion Models**
When applying diffusion models to RL, the basic assumption is that we have a reward function that evaluates the quality of a generated sample (e.g. an image). The goal is for the diffusion model to generate samples that maximize this reward function. Traditionally, diffusion models are trained using a maximum likelihood estimation (MLE) loss function, which encourages the generation of samples that closely resemble the training data. However, in the RL setting, there is no training data available, only samples from the diffusion model and their associated rewards.
One approach is to treat the samples as training data and incorporate the rewards by weighting the loss for each sample accordingly. This method, known as reward-weighted regression (RWR), approximates the reward maximization process. However, RWR suffers from multiple levels of approximation, leading to suboptimal performance. To address this, a new algorithm called denoising diffusion policy optimization (DDPO) is introduced.
**Denoising Diffusion Policy Optimization (DDPO)**
DDPO leverages the insight that maximizing the reward of the final sample is better achieved by considering the entire sequence of denoising steps. The diffusion process is reframed as a multi-step Markov decision process (MDP), where each denoising step is treated as an action. The agent receives a reward only at the final step of each denoising trajectory, when the final sample is generated. This framework allows the utilization of powerful RL algorithms designed for multi-step MDPs.
Instead of relying on the approximate likelihood of the final sample, DDPO uses the exact likelihood of each denoising step, which is computationally straightforward. The algorithm utilizes policy gradient methods, specifically two variants: DDPOSF, which employs the score function estimator, and DDPOIS, which utilizes an importance sampled estimator. DDPOIS stands out as the best-performing algorithm among the two, closely resembling proximal policy optimization (PPO) in its implementation.
**Finetuning Stable Diffusion Using DDPO**
To demonstrate the effectiveness of DDPO, Stable Diffusion v1-4 is finetuned using DDPOIS on four different tasks, each defined by a unique reward function:
1. **Compressibility**: Measures how easily the image can be compressed using the JPEG algorithm. The reward is the negative file size of the image (in kB) when saved as a JPEG.
2. **Incompressibility**: Assesses the level of difficulty in compressing the image using the JPEG algorithm. The reward is the positive file size of the image (in kB) when saved as a JPEG.
3. **Aesthetic Quality**: Evaluates the aesthetic appeal of the image to the human eye. The reward is obtained from the LAION aesthetic predictor, a neural network trained on human preferences.
4. **Prompt-Image Alignment**: Gauges how well the generated image represents the requirements stated in the prompt. This task involves feeding the image into LLaVA, a large vision-language model, which describes the image. The similarity between the description and the original prompt is computed using BERTScore.
During finetuning, a set of prompts is provided to Stable Diffusion. For the first three tasks, simple prompts of the form “a(n) [animal]” are used. In the case of prompt-image alignment, prompts take the form of “a(n) [animal] [activity]”. It is observed that Stable Diffusion struggles to generate images that align with the prompts for these unconventional scenarios, making RL finetuning a critical component for improvement.
**Performance of DDPO on Different Rewards**
The performance of DDPO is illustrated using the simple rewards of compressibility, incompressibility, and aesthetic quality. All images are generated using the same random seed. A noticeable qualitative difference is observed between the “vanilla” Stable Diffusion and the RL-finetuned models. Interestingly, the aesthetic quality model tends towards minimalist black-and-white line drawings, reflecting the aesthetic preferences captured by the LAION aesthetic predictor.
**DDPO on Prompt-Image Alignment**
DDPO is further demonstrated on the more complex task of prompt-image alignment. The training process is visualized through snapshots, showcasing the samples generated for the same prompt and random seed over time. The initial sample is obtained from vanilla Stable Diffusion. Notably, the model gradually shifts towards a cartoon-like style, which was not the intention. This shift is attributed to the fact that the pretraining data often features animals engaging in human-like activities in a cartoon-like style. The model capitalizes on this existing knowledge to align with the prompt more effectively.
**Unexpected Generalization and Overoptimization**
Similar to the surprising generalization observed in finetuning large language models with RL, text-to-image diffusion models also exhibit this phenomenon. For instance, the aesthetic quality model, finetuned on prompts of 45 common animals, generalizes not only to unseen animals but also to everyday objects. Similarly, the prompt-image alignment model, trained on the same set of animals and a few activities, generalizes to unseen animals, unseen activities, and even novel combinations of the two.
Overoptimization, a common challenge in reward-based finetuning, is also observed in the tasks discussed. The models tend to sacrifice meaningful image content to maximize reward. Additionally, the susceptibility of LLaVA to typographic attacks is discovered during the optimization for prompt-image alignment. DDPO successfully fools LLaVA by generating text that loosely resembles the desired responses.
By leveraging DDPO, diffusion models can be trained directly on downstream objectives through RL. This approach allows for more effective reward maximization, surpassing the limitations of existing methods. The use of RL finetuning with Stable Diffusion demonstrates significant improvements in tasks related to compressibility, incompressibility, aesthetic quality, and prompt-image alignment. The unexpected generalization and overoptimization phenomena add further depth to the capabilities and challenges of training diffusion models.