
ISSN: 2960-1436 (Print)
ISSN: 2960-1444 (Online)
CODEN: RLABAV
For any inquiries regarding journal development, the peer review process, copyright matters, or other general questions, please contact the editorial office.
E-Mail: rl@elspub.com
For production or technical issues, please contact the production team.
E-Mail: production@elspub.com
This survey provides a comprehensive synthesis of methods, datasets, metrics, and deployment strategies from the evolution of convolutional neural network (CNN)-based detectors to emerging transformer and hybrid architectures. It unifies fragmented literature into a structured taxonomy while integrating results from 2014–2025 studies. The paper reviews benchmark datasets, discusses evaluation protocols and reproducibility standards, and proposes a deployment playbook considering latency, energy, and hardware constraints. Beyond technical performance, it addresses responsible AI practices and ethical challenges in marine observation. By highlighting open problems in multimodal fusion, self-supervised learning, and on-device adaptation, this work aims to guide future research and practical deployment of underwater vision systems. A comprehensive survey of underwater object detection covering classic CNN-based detectors, modern transformer and hybrid models, training and evaluation practices under challenging aquatic conditions, the dataset landscape, deployment constraints (latency/VRAM/energy), and open problems for real-world marine applications.
This paper introduces FotoBot, a vision-driven autonomous robot photographer designed to enhance human–robot interaction (HRI) and optimize camera parameter control through real-time visual perception. FotoBot integrates Generative Pre-trained Transformers (GPT) for seamless natural language communication, and Bipedal Toric Space (BTS) for vision-guided camera viewpoint control. Utilizing GPT, FotoBot effectively interprets and responds to user instructions, enabling intelligent behavior adjustments. BTS is introduced in this paper for camera position planning, which compresses the camera position representation into three parameters related to photo composition. The BTS representation is analytically converted into Cartesian navigation goals for robot execution. The adoption of BTS ensures the robot’s feasibility around targets and adherence to cinematographic standards. Deployed on a biped robot platform, FotoBot demonstrates comprehensive navigation capabilities, effective human-robot interaction, and outstanding auto-photography performance. User trials conducted at the Hong Kong Science Park have validated FotoBot’s proficiency in navigating complex terrains and capturing high-quality photographs while intelligently responding to user instructions. Videos and code are available on the project website: https://sites.google.com/view/fotobot/fotobot.
In complex assembly scenarios, Multimodal Large Language Models (MLLMs), despite their strong vision-language understanding capabilities, remain limited in their ability to produce structured and executable assembly plans directly from raw visual observations. This difficulty is particularly evident in black-box settings, where prompt design depends heavily on human experience and repeated trial-and-error, often leading to unstable results and high iteration costs. To address these issues, this paper presents a Perception-Recognition-Planning-Action (PRPA) framework for robotic assembly that enables the direct derivation of assembly instructions from scene images. The framework incorporates two key components. A self-prompt Segment Anything Model (SAM) is used to automatically generate structured and verifiable visual representations of assembly parts, ensuring consistent inputs for subsequent reasoning. In addition, a discrete prompt optimization mechanism is introduced to refine prompts for black-box MLLMs through iterative quality assessment and targeted symbolic modifications, improving the reliability of part recognition, semantic attribute extraction, and functional relationship modeling. Together, these components allow the system to generate temporally ordered and physically feasible assembly action sequences, which are represented as symbolic assembly plans suitable for both human interpretation and robotic execution. By combining MLLM-based reasoning with structured assembly planning, the proposed approach shifts the role of language models from interpreting predefined instructions to directly supporting instruction generation from visual input. Experimental results show that the proposed prompt optimization mechanism reduces the average number of reasoning attempts by 48% and achieves 95% stability in part recognition.