A survey on omni-modal language models
1 School of Computer and Artiffcial Intelligence, Shandong Jianzhu University, Jinan, China
2 School of Software, Shandong University, Jinan, China
3 School of Computer Science and Technology, Shandong University, Qingdao, China
Abstract

This paper provides a comprehensive review of Omni-Modal Language Models (OMLMs), focusing on their evolution, technical challenges, application scenarios, and evaluation frameworks. OMLMs represent a significant leap from traditional unimodal and multimodal models by unifying modalities like text, images, audio, and video into a cohesive architecture. These models aim to simulate human-like multimodal perception, achieving semantic alignment and dynamic interaction between diverse data sources. Key topics covered include modality alignment, semantic fusion, and joint representation learning, alongside their application in fields such as healthcare, education, and industrial quality inspection. The paper also examines vertical adaptation paths, knowledge injection mechanisms, real-time optimization strategies, and a multi-dimensional evaluation system. Finally, future research directions are proposed, including improvements in generalization, task adaptability, energy efficiency, and ethical considerations, all critical for the widespread deployment of OMLMs in complex, real-world scenarios.

Keywords

omni-modal language models; semantic fusion; modality alignment; joint representation learning; cross-modal interaction

Preview