Multimodal trajectory prediction based on dynamic scene encoding and relational reasoning
1 State Key Laboratory of Automotive Chassis Integration and Bionics, Jilin University, Changchun, China
2 College of Automotive Engineering, Jilin University, Changchun, China
  • Volume
  • Citation
    Song L, Li Z, Xiong Z, Wei Z, Zhao R, et al. Multimodal trajectory prediction based on dynamic scene encoding and relational reasoning. Artif. Intell. Auton. Syst. 2026(1):0005, https://doi.org/10.55092/aias20260005. 
  • DOI
    10.55092/aias20260005
  • Copyright
    Copyright2026 by the authors. Published by ELSP.
Abstract

Autonomous vehicles require effective prediction of potential future trajectories of surrounding agents. The current trajectory prediction methods have limitations, firstly, traditional feature fusion methods merge scene features sequentially in a simplistic manner, often overlooking the intricate interrelations among scene elements, leading to incomplete selection and insufficient utilization of useful features; secondly, in multimodal trajectory prediction, the mode collapse issue inherent to probabilistic approaches results in inadequate expression of agent intent uncertainty, while overly anchor-dependent proposal-based methods can generate implausible trajectories. To address these limitations, We present a Dynamic scene and Relational reasoning Transformer (DRTR), a novel multimodal trajectory prediction method based on dynamic scene encoding and relational reasoning. A pivotal aspect of DRTR is the dynamic closed-loop modeling framework that effectively combines scene features to output three dynamic features: dynamic traffic flow, dynamic agents, and interactions between agents. This innovative framework ensures a comprehensive capture of the dynamic scene and its intricate interrelations. Then, DRTR initializes a set of trajectory suggestions representing various modalities and carefully refines these suggestions by sequentially fusing and querying dynamic scene features, ensuring predictions are both accurate and reflect multimodality. To further enhance model expressiveness, we introduce a feature selection network based on relational reasoning, which can recognize deep relationships between scene elements and select beneficial contextual features. Experiments on the Argoverse 1 dataset indicate that DRTR exhibits superior performance, particularly in multimodal trajectory prediction.

Keywords

dynamic scene encoding; relational reasoning; multimodal prediction; trajectory prediction

Preview