Abstract
The promise of monocular Simultaneous Localization and Mapping (SLAM) is profound: endowing autonomous systems with the ability to perceive, understand, and navigate uncharted environments using a single camera, the most ubiquitous and biologically-inspired sensor. For over two decades, the core mathematical and algorithmic challenges of real-time structure from motion have been largely conquered, enabling robust operation in constrained, short-term sequences. However, the leap from a functioning demonstrator to a reliable, persistent spatial intelligence agent hinges on solving two intertwined, grand challenges: long-term localization and global map consistency. This article contends that these are not merely peripheral issues but the central bottlenecks preventing the deployment of monocular SLAM in real-world applications that demand longevity, such as domestic robotics, augmented reality, and autonomous inspection. We explore the fundamental limitations of purely visual methods in scale-drift management, the catastrophic impact of incremental error accumulation on map utility, and the necessity of map maintenance and correction over extended periods. The discourse will navigate through the evolution of techniques from covisibility graphs and pose-graph optimization to modern explicit and implicit mapping strategies, analyzing how each confronts the demons of long-term operation. Ultimately, we argue that the future of robust, consistent monocular SLAM lies not in increasingly complex purely visual systems, but in the principled, tight integration of other weak or intermittent cues - be it inertial measurements, sparse depth, semantic permanence, or learned priors - to anchor the visual SLAM system to a stable, consistent, and reusable representation of the world.
This work is licensed under a Creative Commons Attribution 4.0 International License.
