The evolution of end-to-end Vision-Language-Action (VLA) architectures in robotics

Jingjing Pei; Xiaoyin Zheng; Yang Liu; Bike Zhu; Daifeng Wang; Jiajun An; Richard  Voyles; Xin Ma

doi:10.55092/rl20260010

Abstract

Vision-Language-Action (VLA) models represent a fundamental architectural shift in robotic learning, replacing modular perception-reasoning-control pipelines with unified frameworks that jointly optimize multimodal understanding and motor control. While large language models (LLMs) have enabled natural language grounding in robotics, the core challenge remains how to effectively fuse visual perception, linguistic reasoning, and continuous action generation within a single coherent architecture. This survey provides a systematic decomposition of modern VLA systems into three critical components, including multimodal perception encoders, cross-modal fusion mechanisms, and action decoders. We also critically evaluate the impact of design choices on generalization, sample efficiency, and task complexity. We distinguish two dominant architectural paradigms: end-to-end models that directly map observations to actions through learned representations, and hierarchical models that decompose tasks into explicit planning and execution stages. Through comparative analysis of their trade-offs in zero-shot generalization, interpretability, and long-horizon performance, we identify key open challenges in semantic grounding, spatial reasoning, and sim-to-real transfer that will determine the viability of VLAs for real-world deployment.

Keywords

Vision-Language-Action (VLA); embodied AI; robot learning

Preview

view pdf