OmniDrag: Enabling Motion Control for Omnidirectional Image-to-Video Generation

Abstract

As virtual reality gains popularity, the demand for controllable creation of immersive and dynamic omnidirectional videos (ODVs) is increasing. While previous text-to-ODV generation methods achieve impressive results, they struggle with content inaccuracies and inconsistencies due to reliance solely on textual inputs. Although recent motion control techniques provide fine-grained control for video generation, directly applying these methods to ODVs often results in spatial distortion and unsatisfactory performance, especially with complex spherical motions. To tackle these challenges, we propose OmniDrag, the first approach enabling both scene- and object-level motion control for accurate, high-quality omnidirectional image-to-video generation. Building on pretrained video diffusion models, we introduce an omnidirectional control module, which is jointly fine-tuned with temporal attention layers to effectively handle complex spherical motion. In addition, we develop a novel spherical motion estimator that accurately extracts motion-control signals and allows users to perform drag-style ODV generation by simply drawing handle and target points. We also present a new dataset, named Move360, addressing the scarcity of ODV data with large scene and object motions. Experiments demonstrate the significant superiority of OmniDrag in achieving holistic scene-level and fine-grained object-level control for ODV generation. Our code, models, and collected dataset will be made available.

Method

Overall pipeline of proposed OmniDrag. (a) During training, spherical motion is extracted by the proposed spherical motion estimator. The Omni Controller and temporal attention layers in the UNet denoiser are jointly fine-tuned. (b) During inference, OmniDrag allows users to simply select handle and target points on the reference image and generates ODVs with the corresponding motion.

OmniDrag: Enabling Motion Control for Omnidirectional Image-to-Video Generation

Abstract

Method

Samples of Move360 Dataset

Scene-Level Motion Control

Object-Level Motion Control

Video Results for Ablation Studies (Fig. 13)

Video Results under Various Scenes and Trajectories