For Unmanned Aerial Vehicle (UAV) swarms operating in large-scale unknown environments, lightweight long-range mapping is crucial for enhancing safe navigation. Traditional stereo cameras constrained by a short fixed baseline suffer from limited perception ranges. To overcome this limitation, we present Flying Co-Stereo, a cross-agent collaborative stereo vision system that leverages the wide-baseline spatial configuration of two UAVs for long-range dense mapping. However, realizing this capability presents several challenges. First, the independent motion of each UAV leads to a dynamic and continuously changing stereo baseline, making accurate and robust estimation difficult. Second, efficiently establishing feature correspondences across independently moving viewpoints is constrained by the limited computational capacity of onboard edge devices.
To tackle these challenges, we introduce the Flying Co-Stereo system within a novel Collaborative Dynamic-Baseline Stereo Mapping (CDBSM) framework. We first develop a dual-spectrum visual-inertial-ranging estimator to achieve robust and precise online estimation of the baseline between the two UAVs. In addition, we propose a hybrid feature association strategy that integrates cross-agent feature matching—based on a computationally intensive yet accurate deep neural network—with intra-agent, optical-flow-based lightweight feature tracking. Furthermore, benefiting from the wide baselines between the two UAVs, our system accurately recovers long-range co-visible 3D sparse points. We then employ a monocular depth network to predict up-to-scale dense depth maps, which are refined using accurate metric scales derived from the triangulated sparse points via exponential fitting.
Extensive real-world experiments demonstrate that the proposed Flying Co-Stereo system achieves robust and accurate dynamic baseline estimation in complex environments while maintaining efficient feature matching with resource-constrained computers under varying viewpoints. Ultimately, our system achieves dense 3D mapping at distances of up to 70 meters with a relative error between 2.3% and 9.7%. This corresponds to up to a 350% improvement in maximum perception range and up to a 450% increase in coverage area compared to conventional stereo vision systems with fixed compact baselines.
The system architecture of Flying Co-Stereo within our proposed Collaborative Dynamic-Baseline Stereo Mapping framework.
Comparison of reconstruction from MVSAnywhere, SimpleRecon, with our proposed CDBSM, shown alongside ground truth.
@article{Wang2025FlyingCoStereo,
author = {Zhaoying Wang, Xingxing Zuo and Wei Dong},
title = {Flying Co-Stereo: Enabling Long-Range Aerial Dense Mapping via Collaborative Stereo Vision of Dynamic-Baseline},
journal = {arXiv preprint arXiv:2506.00546},
year = {2025},
}