From Pixels to Actions: The Hidden Role of SLAM in VLA Model Training and Evaluation
Published in Arxiv, 2026
Can an autonomous vehicle understand park next to the red car? This seemingly simple instruction demands a remarkable confluence of capabilities: seeing the scene, understanding language, knowing where it is, and predicting what happens when it acts. Vision-Language-Action (VLA) models promise to unify these capabilities—but their success quietly depends on solving a much older problem. This article traces the evolution from early End-to-End driving systems to modern VLAs, revealing a dependency the research community often overlooks: robust Simultaneous Localization and Mapping (SLAM) is the foundation upon which VLA training, evaluation, and deployment rest. We show that VLAs are not a complete autonomy stack, but sophisticated sub-systems embedded within a larger stack. They inherit spatial intelligence from classical systems: SLAM provides the pose, maps provide the route, and neural rendering provides the evaluation substrate. The article is structured as follows. Section 2 traces the architectural evolution from PilotNet to modern VLAs. Section 3 clarifies VLA’s role in the autonomy stack for autonomous vehicles and robotics, and its dependencies on upstream systems. Section 4 underlines the connection between VLA evaluation and SLAM, covering pose estimation, neural rendering, and the metric-scale imperative. We conclude that, as more companies embrace this new paradigm of VLAs, we expect to see more evolved neural rendering engines powered by robust SLAM solutions, with the ability to evaluate VLA models seamlessly via a generic interface.
Citation: Das, S., 2026. From Pixels to Actions: The Hidden Role of SLAM in VLA Model Training and Evaluation. Paper
