In complex assembly scenarios, Multimodal Large Language Models (MLLMs), despite their strong vision-language understanding capabilities, remain limited in their ability to produce structured and executable assembly plans directly from raw visual observations. This difficulty is particularly evident in black-box settings, where prompt design depends heavily on human experience and repeated trial-and-error, often leading to unstable results and high iteration costs. To address these issues, this paper presents a Perception-Recognition-Planning-Action (PRPA) framework for robotic assembly that enables the direct derivation of assembly instructions from scene images. The framework incorporates two key components. A self-prompt Segment Anything Model (SAM) is used to automatically generate structured and verifiable visual representations of assembly parts, ensuring consistent inputs for subsequent reasoning. In addition, a discrete prompt optimization mechanism is introduced to refine prompts for black-box MLLMs through iterative quality assessment and targeted symbolic modifications, improving the reliability of part recognition, semantic attribute extraction, and functional relationship modeling. Together, these components allow the system to generate temporally ordered and physically feasible assembly action sequences, which are represented as symbolic assembly plans suitable for both human interpretation and robotic execution. By combining MLLM-based reasoning with structured assembly planning, the proposed approach shifts the role of language models from interpreting predefined instructions to directly supporting instruction generation from visual input. Experimental results show that the proposed prompt optimization mechanism reduces the average number of reasoning attempts by 48% and achieves 95% stability in part recognition.
assembly sequence planning; MLLM-based semantic understanding; self-prompt SAM; prompt optimization