Embodied Question Answering (EQA) enables robots to explore the environment and answer human questions, which is important for human-robot interaction and has been significantly enhanced by the recent advancement of Vision-Language Models (VLMs). However, EQA in real-world scenarios remains challenging, as human-posed questions often contain noise that can interfere with an embodied agent’s reasoning, bringing challenges for language beginners and non-expert users. To address this, we introduce a NoisyEQA benchmark designed to evaluate the ability of the robot to identify and correct noisy questions. NoisyEQA is inspired by three common types of noise observed in real-world applications: Memory Noise, Perception Noise, and Semantic Noise, generated through an automated dataset creation framework. Additionally, we propose a ‘Self-Correction’ prompting mechanism to enhance EQA against noise robustness and a novel evaluation metric to measure both noise detection capability and answer quality. Our comprehensive study reveals that current embodied agents often struggle to detect noise in questions, leading to responses that frequently contain incorrect information. Through our self-correct prompting mechanism, we can effectively improve the accuracy of agent answers.
Embodied Question Answering; navigation; embodied LLM; active agents