The evolution of end-to-end Vision-Language-Action (VLA) architectures in robotics
1 Department of Mechanical and Energy Engineering, Institute for Robotics Research, Southern University of Science and Technology, Shenzhen 518055, China
2 The Autonomous Driving Center, XMotors.ai., Inc., Santa Clara 95054, USA
3 State Key Laboratory of High-performance Precision Manufacturing, Dalian University of Technology, Dalian 116024, China
4 Department of Computer Science and Engineering, The University of Texas at Arlington, Arlington 76019, USA
Abstract

Vision-Language-Action (VLA) models represent a fundamental architectural shift in robotic learning, replacing modular perception-reasoning-control pipelines with unified frameworks that jointly optimize multimodal understanding and motor control. While large language models (LLMs) have enabled natural language grounding in robotics, the core challenge remains how to effectively fuse visual perception, linguistic reasoning, and continuous action generation within a single coherent architecture. This survey provides a systematic decomposition of modern VLA systems into three critical components, including multimodal perception encoders, cross-modal fusion mechanisms, and action decoders. We also critically evaluate the impact of design choices on generalization, sample efficiency, and task complexity. We distinguish two dominant architectural paradigms: end-to-end models that directly map observations to actions through learned representations, and hierarchical models that decompose tasks into explicit planning and execution stages. Through comparative analysis of their trade-offs in zero-shot generalization, interpretability, and long-horizon performance, we identify key open challenges in semantic grounding, spatial reasoning, and sim-to-real transfer that will determine the viability of VLAs for real-world deployment.

Keywords

Vision-Language-Action (VLA); embodied AI; robot learning

Preview