Video Diffusion Alignment via Reward Gradient

Anonymous Authors

Abstract

We have made significant progress towards building foundational video diffusion models. As these models are trained using large-scale unsupervised data, it has become crucial to adapt these models to specific downstream tasks, such as video-text alignment or ethical video generation. Adapting these models via supervised fine-tuning requires collecting target datasets of videos, which is challenging and tedious. In this work, we instead utilize pre-trained reward models that are learned via preferences on top of powerful discriminative models. These models contain dense gradient information with respect to generated RGB pixels, which is critical to be able to learn efficiently in complex search spaces, such as videos. We show that our approach can enable alignment of video diffusion for aesthetic generations, similarity between text context and video, as well long horizon video generations that are 3X longer than the training sequence length. We show our approach can learn much more efficiently in terms of reward queries and compute than previous gradient-free approaches for video generation.

Aesthetic Reward, in-distribution prompts

We compare VADER against its base model ModelScope and prior alignment approaches, including DDPO and DiffusionDPO, using the aesthetics reward, for in-distribution prompts.

"A man smiles as he stirs his food in the pot"
VADER (Ours)
ModelScope
DDPO
DPO
"A woman in a purple top pulling food out of a oven"
VADER (Ours)
ModelScope
DDPO
DPO
"At night on a street with a group of a bicycle riders riding down the road together"
VADER (Ours)
ModelScope
DDPO
DPO

Click here for More Results

HPS Reward, in-distribution prompts

We compare VADER against its base model ModelScope and prior alignment approaches, including DDPO and DiffusionDPO, using the HPS reward, for in-distribution prompts.

"some people holding umbrellas and standing by a car in the rain"
VADER (Ours)
ModelScope
DDPO
DPO
"A woman eating fresh vegetables from a bowl"
VADER (Ours)
ModelScope
DDPO
DPO
"A man getting food ready while people watch."
VADER (Ours)
ModelScope
DDPO
DPO

Click here for More Results

Aesthetic Reward, OOD prompts

We compare VADER against its base model ModelScope and prior alignment approaches, including DDPO and DiffusionDPO, using the aesthetics reward, for OOD prompts.

"a dolphin riding a bike"
VADER (Ours)
ModelScope
DDPO
DPO
"a wolf washing the dishes"
VADER (Ours)
ModelScope
DDPO
DPO
"a sheep playing chess"
VADER (Ours)
ModelScope
DDPO
DPO

Click here for More Results

HPS Reward, OOD prompts

We compare VADER against its base model ModelScope and prior alignment approaches, including DDPO and DiffusionDPO, using the HPS reward, for OOD prompts.

"a chicken riding a bike"
VADER (Ours)
ModelScope
DDPO
DPO
"a monkey washing the dishes"
VADER (Ours)
ModelScope
DDPO
DPO
"a deer playing chess"
VADER (Ours)
ModelScope
DDPO
DPO

Click here for More Results