Papers
arxiv:2512.16915

StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors

Published on Dec 18
· Submitted by
Guibao SHEN
on Dec 19
Authors:
,
,
,
,
,
,
,
,
,

Abstract

StereoPilot, a feed-forward model leveraging a learnable domain switcher and cycle consistency loss, synthesizes high-quality stereo video directly without depth maps, outperforming existing methods in visual fidelity and computational efficiency.

AI-generated summary

The rapid growth of stereoscopic displays, including VR headsets and 3D cinemas, has led to increasing demand for high-quality stereo video content. However, producing 3D videos remains costly and complex, while automatic Monocular-to-Stereo conversion is hindered by the limitations of the multi-stage ``Depth-Warp-Inpaint'' (DWI) pipeline. This paradigm suffers from error propagation, depth ambiguity, and format inconsistency between parallel and converged stereo configurations. To address these challenges, we introduce UniStereo, the first large-scale unified dataset for stereo video conversion, covering both stereo formats to enable fair benchmarking and robust model training. Building upon this dataset, we propose StereoPilot, an efficient feed-forward model that directly synthesizes the target view without relying on explicit depth maps or iterative diffusion sampling. Equipped with a learnable domain switcher and a cycle consistency loss, StereoPilot adapts seamlessly to different stereo formats and achieves improved consistency. Extensive experiments demonstrate that StereoPilot significantly outperforms state-of-the-art methods in both visual fidelity and computational efficiency. Project page: https://hit-perfect.github.io/StereoPilot/.

Community

Paper submitter

StereoPilot replaces the fragile "Depth-Warp-Inpaint" pipeline with a single-step feed-forward model that leverages video diffusion priors for efficient, high-fidelity monocular-to-stereo conversion. By training on the large-scale UniStereo dataset with a learnable domain switcher, it provides a unified and efficient solution for both parallel and converged 3D video formats.

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.16915 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.16915 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.