Home | Robot Learning | Journal

Deep learning for underwater object detection: a comprehensive survey of models, datasets, and challenges

DOI: 10.55092/rl20260020

Hari Bhandari,Pengcheng Liu

Survey16 Jun 2026OPEN ACCESS

This survey provides a comprehensive synthesis of methods, datasets, metrics, and deployment strategies from the evolution of convolutional neural network (CNN)-based detectors to emerging transformer and hybrid architectures. It unifies fragmented literature into a structured taxonomy while integrating results from 2014–2025 studies. The paper reviews benchmark datasets, discusses evaluation protocols and reproducibility standards, and proposes a deployment playbook considering latency, energy, and hardware constraints. Beyond technical performance, it addresses responsible AI practices and ethical challenges in marine observation. By highlighting open problems in multimodal fusion, self-supervised learning, and on-device adaptation, this work aims to guide future research and practical deployment of underwater vision systems. A comprehensive survey of underwater object detection covering classic CNN-based detectors, modern transformer and hybrid models, training and evaluation practices under challenging aquatic conditions, the dataset landscape, deployment constraints (latency/VRAM/energy), and open problems for real-world marine applications.

Abstract PDF

FotoBot: an embodied AI photography robot system its design, prototyping and application

DOI: 10.55092/rl20260019

Dawei Wang,Chang Chen,Yipeng Pan,Xinzheng Tang,Hua Chen,Jia Pan

Article26 May 2026OPEN ACCESS

This paper introduces FotoBot, a vision-driven autonomous robot photographer designed to enhance human–robot interaction (HRI) and optimize camera parameter control through real-time visual perception. FotoBot integrates Generative Pre-trained Transformers (GPT) for seamless natural language communication, and Bipedal Toric Space (BTS) for vision-guided camera viewpoint control. Utilizing GPT, FotoBot effectively interprets and responds to user instructions, enabling intelligent behavior adjustments. BTS is introduced in this paper for camera position planning, which compresses the camera position representation into three parameters related to photo composition. The BTS representation is analytically converted into Cartesian navigation goals for robot execution. The adoption of BTS ensures the robot’s feasibility around targets and adherence to cinematographic standards. Deployed on a biped robot platform, FotoBot demonstrates comprehensive navigation capabilities, effective human-robot interaction, and outstanding auto-photography performance. User trials conducted at the Hong Kong Science Park have validated FotoBot’s proficiency in navigating complex terrains and capturing high-quality photographs while intelligently responding to user instructions. Videos and code are available on the project website: https://sites.google.com/view/fotobot/fotobot.

Abstract PDF

Robotic assembly via self-prompt Segment Anything Model and discrete prompt optimization

DOI: 10.55092/rl20260018

Qi Guo,Xing Liu,Haitao Chang,Zhengxiong Liu,Panfeng Huang

Article22 May 2026OPEN ACCESS

In complex assembly scenarios, Multimodal Large Language Models (MLLMs), despite their strong vision-language understanding capabilities, remain limited in their ability to produce structured and executable assembly plans directly from raw visual observations. This difficulty is particularly evident in black-box settings, where prompt design depends heavily on human experience and repeated trial-and-error, often leading to unstable results and high iteration costs. To address these issues, this paper presents a Perception-Recognition-Planning-Action (PRPA) framework for robotic assembly that enables the direct derivation of assembly instructions from scene images. The framework incorporates two key components. A self-prompt Segment Anything Model (SAM) is used to automatically generate structured and verifiable visual representations of assembly parts, ensuring consistent inputs for subsequent reasoning. In addition, a discrete prompt optimization mechanism is introduced to refine prompts for black-box MLLMs through iterative quality assessment and targeted symbolic modifications, improving the reliability of part recognition, semantic attribute extraction, and functional relationship modeling. Together, these components allow the system to generate temporally ordered and physically feasible assembly action sequences, which are represented as symbolic assembly plans suitable for both human interpretation and robotic execution. By combining MLLM-based reasoning with structured assembly planning, the proposed approach shifts the role of language models from interpreting predefined instructions to directly supporting instruction generation from visual input. Experimental results show that the proposed prompt optimization mechanism reduces the average number of reasoning attempts by 48% and achieves 95% stability in part recognition.

Abstract PDF

Top Downloaded

Review on path planning for obstacle avoidance oriented to micro-/nanorobots

DOI: 10.55092/rl20240002

Tongzhou Ye,Tianhao Peng,Lidong Yang

Review14 Nov 2024OPEN ACCESS

Path planning algorithms are indispensable for controlling micro-/nanorobots through complex and unknown environments in the biomedical and medical fields. With the tasks performed becoming more complex, higher-quality paths are required to avoid obstacles for ensuring the safe and efficient movement of micro-/nanorobots. A comparative analysis of path planning algorithms is conducted to elucidate the algorithm’s application and optimization for different environments. According to the environment modeling approach, existing path planning algorithms are classified into searching, sampling, and dynamic aspects. Searching path planning algorithms directly retrieve the global path possessing minimum cost from the modeled static waypoints. Sampling path planning algorithms employ randomly sampled waypoints within the target space, which eliminates the necessity for environmental modeling. Dynamic path planning algorithms utilize local paths to regulate the motion of micro-/nanorobots in real time. Deep learning networks based on big data will become an important research direction for the control and navigation of micro-/nanorobots. The advantages and limitations of path planning algorithms in varied spatial contexts are elucidated through detailed examples and descriptions, providing a comprehensive understanding of performance and applicability. This review underscores recent advancements in this emerging domain and stands as a testament to the dynamic landscape of micro-/nanorobotics and the continual pursuit of superior motion control solutions.

Abstract PDF References

Survey on heterogeneous aquatic robot systems: communication, perception, navigation, control, decision-making and energy management

DOI: 10.55092/rl20250003

Ruonan Liu,Xiuzhong Hu,Zihan Jiang,Junzhi Wang,Weidong Zhang

Survey30 May 2025OPEN ACCESS

Heterogeneous aquatic robot systems, consisting of ROVs, AUVs, ASVs, and UAVs, are vital for environmental exploration, monitoring, and task execution. This paper presents advancements in critical technologies within these systems, focusing on communication (underwater acoustic, radio, and optical), multi-sensor fusion, and collaborative navigation techniques. It reviews control strategies like deep reinforcement learning, end-to-end control, and large model-based methods, addressing autonomous decision-making and adaptability in complex environments. The paper also discusses energy management strategies for efficient storage, utilization, and recovery. Furthermore, it explores the ethical and environmental impacts of deploying such systems, emphasizing sustainability and minimizing ecological disruptions. Finally, case studies and applications in ocean exploration and environmental monitoring are highlighted, showcasing the real-world utility and future potential of heterogeneous aquatic robot systems. This work provides valuable insights into the technological, ethical, and practical considerations for developing these systems.

Abstract PDF

Optimizing scene flow with neural rigidity prior

DOI: 10.55092/rl20240004

Zhiheng Feng,Jiuming Liu,Hesheng Wang

Article28 Nov 2024OPEN ACCESS

Scene flow estimation provides the 3D low-level motion understanding in dynamic scenes. In this paper, we propose an optimization-based scene flow estimation method with neural rigidity prior for the autonomous driving environment. Specifically, we utilize the rigidity prior of dynamic scenes to partition the point clouds into pillars of different resolutions. Then, the flow vector of a point is represented as the average of local rigid transformations associated with the different pillars to which it belongs. To model local rigidity, we employ the neural implicit representation for encoding the intrinsic constraints of pillars. Our method achieves state-of-the-art accuracy on three commonly-used autonomous driving datasets: Argoverse, Waymo, and nuScenes, and even surpasses previous supervised learning-based methods. Experiment results demonstrate the effectiveness of our method, particularly on sparse points in the autonomous driving scene.

Abstract PDF