OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization

Abstract

Video diffusion models (VDMs) have demonstrated remarkable capabilities in text-to-video (T2V) generation. Despite their success, VDMs still suffer from degraded image quality and flickering artifacts. To address these issues, some approaches have introduced preference learning to exploit human feedback to enhance the video generation. However, these methods primarily adopt the routine in the image domain without an in-depth investigation into video-specific preference optimization.

In this paper, we reexamine the design of the video preference learning from two key aspects: feedback source and feedback tuning methodology, and present OnlineVPO, a more efficient preference learning framework tailored specifically for VDMs. On the feedback source, we found that the image-level reward model commonly used in existing methods fails to provide a human-aligned video preference signal due to the modality gap. In contrast, video quality assessment (VQA) models show superior alignment with human perception of video quality. Building on this insight, we propose leveraging VQA models as a proxy of humans to provide more modality-aligned feedback for VDMs. Regarding the preference tuning methodology, we introduce an online DPO algorithm tailored for VDMs. It not only enjoys the benefits of superior scalability in optimizing videos with higher resolution and longer duration compared with the existing method, but also mitigates the insufficient optimization issue caused by off-policy learning via online preference generation and curriculum preference update designs.

Extensive experiments on the open-source video-diffusion model demonstrate OnlineVPO as a simple yet effective and, more importantly, scalable preference learning algorithm for video diffusion models.

Key Contributions

Video-Centric Feedback: We identify the modality gap in image-based reward models and propose using Video Quality Assessment (VQA) models as a superior proxy for human video preference.
Online Preference Optimization: We introduce an Online DPO framework specifically tailored for VDMs, addressing the "off-policy" issues found in previous offline methods.
Scalability & Performance: OnlineVPO demonstrates superior scalability for high-resolution, long-duration videos and achieves state-of-the-art results on VBench.

Motivation

Current video feedback learning methods primarily rely on (1) Image-based Reward Models (e.g., ImageReward). However, we observe a significant Modality Gap: image rewards focus on spatial details but fail to capture temporal distortions (e.g., flickering, motion inconsistency). (2) Off-policy Optimization (e.g., Offline DPO) suffer from Insufficient Optimization due to the static dataset, leading to suboptimal video quality.

Figure 1: Comparison of different preference optimization paradigms for VDMs. Existing methods suffer from misaligned feedback, limited scalability, and off-policy optimization.

Method: OnlineVPO

1. The Pipeline

We propose OnlineVPO, a framework designed to align Video Diffusion Models (VDMs) with human values using video-centric signals. Unlike offline methods, our approach generates preferences on-the-fly and updates the policy via an online curriculum, ensuring the reward model stays aligned with the current policy distribution.

Figure 2: Overview of the OnlineVPO framework. We utilize a VQA model for reward labeling and optimize the VDM via direct preference optimization in online manner iteratively.

2. Method Comparison

We summarize the difference between OnlineVPO and existing methods in the following table, our approach achieves the online and video-centric preference optimization, and further exhibits superior scalability and effectiveness.

Experiments

Quantitative Comparison

We evaluate OnlineVPO on the standard VBench benchmark. Our method achieves significant improvements over baselines, particularly in video quality and temporal consistency metrics.

Efficiency & Scalability

We further investigate the scalability of OnlineVPO. The formulation of the direct forward preference optimization avoids the heavy memory burden of storing extensive replay buffers. This allows OnlineVPO to efficiently fine-tune VDMs on higher resolutions and longer frame sequences with significantly lower GPU memory compared to ReFL baseline.

Figure 3: GPU memory usage comparison. OnlineVPO exhibits superior scalability across increasing resolutions and frame lengths.

Qualitative Comparison

Below are visual comparisons. OnlineVPO generates videos with higher visual fidelity and smoother motion compared to the base models.

Base Model: OpenSora VDM + OnlineVPO

"A bird building a nest..."

"A cat wearing sunglasses..."

"A corgi playing drum kit"

"Iron Man flying in the sky"

Base Model: VideoCrafter VDM + OnlineVPO

"Lightning striking Eiffel Tower"

"A person sweeping floor"

"Gwen Stacy reading a book"

"Motion colour drop in water"

BibTeX

@misc{zhang2024onlinevpoalignvideodiffusion,
      title={OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization}, 
      author={Jiacheng Zhang and Jie Wu and Weifeng Chen and Yatai Ji and Xuefeng Xiao and Weilin Huang and Kai Han},
      year={2024},
      eprint={2412.15159},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.15159}, 
}