A job failure identification method of heterogeneous supercomputing platforms based on semantic analysis of multi-source logs.

Saved in:
Bibliographic Details
Title: A job failure identification method of heterogeneous supercomputing platforms based on semantic analysis of multi-source logs.
Authors: HU, He1, ZHAO, Yi1 zhaoyi@cnic.cn, GU, Beibei1,2, ZHAO, Yunqing1
Source: Computer Engineering & Science / Jisuanji Gongcheng yu Kexue. Sep2025, Vol. 47 Issue 9, p1535-1543. 9p.
Subjects: Heterogeneous computing, Anomaly detection (Computer security), Text mining, Distributed computing, Data mining
Abstract: This paper presents a method for detecting job anomalies in large-scale distributed HPC heterogeneous platforms. Analyzing job runtime logs is vital for detecting anomalies, but the sheer volume of logs hinders human comprehension. To address this, we introduce a multi-source log semantic analysis approach using latent Dirichlet allocation (LDA) to analyze logs from various sources. By modeling topic evolution over time and matching with historical faulty job patterns, it predicts anomalies. Experiments on a domestic HPC platform show 95.2% precision, enhancing predictive capability and aiding users and administrators in quickly diagnosing issues, thereby improving HPC environment availability and efficiency. [ABSTRACT FROM AUTHOR]
Copyright of Computer Engineering & Science / Jisuanji Gongcheng yu Kexue is the property of Computer Engineering & Science and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database: Engineering Source
Description
Abstract:This paper presents a method for detecting job anomalies in large-scale distributed HPC heterogeneous platforms. Analyzing job runtime logs is vital for detecting anomalies, but the sheer volume of logs hinders human comprehension. To address this, we introduce a multi-source log semantic analysis approach using latent Dirichlet allocation (LDA) to analyze logs from various sources. By modeling topic evolution over time and matching with historical faulty job patterns, it predicts anomalies. Experiments on a domestic HPC platform show 95.2% precision, enhancing predictive capability and aiding users and administrators in quickly diagnosing issues, thereby improving HPC environment availability and efficiency. [ABSTRACT FROM AUTHOR]
ISSN:1007130X
DOI:10.3969/j.issn.1007-130X.2025.09.002