This paper introduces FotoBot, a vision-driven autonomous robot photographer designed to enhance human–robot interaction (HRI) and optimize camera parameter control through real-time visual perception. FotoBot integrates Generative Pre-trained Transformers (GPT) for seamless natural language communication, and Bipedal Toric Space (BTS) for vision-guided camera viewpoint control. Utilizing GPT, FotoBot effectively interprets and responds to user instructions, enabling intelligent behavior adjustments. BTS is introduced in this paper for camera position planning, which compresses the camera position representation into three parameters related to photo composition. The BTS representation is analytically converted into Cartesian navigation goals for robot execution. The adoption of BTS ensures the robot’s feasibility around targets and adherence to cinematographic standards. Deployed on a biped robot platform, FotoBot demonstrates comprehensive navigation capabilities, effective human-robot interaction, and outstanding auto-photography performance. User trials conducted at the Hong Kong Science Park have validated FotoBot’s proficiency in navigating complex terrains and capturing high-quality photographs while intelligently responding to user instructions. Videos and code are available on the project website: https://sites.google.com/view/fotobot/fotobot.
autonomous robotic photography; human-robot interaction; embodied AI