A job failure identification method of heterogeneous supercomputing platforms based on semantic analysis of multi-source logs.
Saved in:
| Title: | A job failure identification method of heterogeneous supercomputing platforms based on semantic analysis of multi-source logs. |
|---|---|
| Authors: | HU, He1, ZHAO, Yi1 zhaoyi@cnic.cn, GU, Beibei1,2, ZHAO, Yunqing1 |
| Source: | Computer Engineering & Science / Jisuanji Gongcheng yu Kexue. Sep2025, Vol. 47 Issue 9, p1535-1543. 9p. |
| Subjects: | Heterogeneous computing, Anomaly detection (Computer security), Text mining, Distributed computing, Data mining |
| Abstract: | This paper presents a method for detecting job anomalies in large-scale distributed HPC heterogeneous platforms. Analyzing job runtime logs is vital for detecting anomalies, but the sheer volume of logs hinders human comprehension. To address this, we introduce a multi-source log semantic analysis approach using latent Dirichlet allocation (LDA) to analyze logs from various sources. By modeling topic evolution over time and matching with historical faulty job patterns, it predicts anomalies. Experiments on a domestic HPC platform show 95.2% precision, enhancing predictive capability and aiding users and administrators in quickly diagnosing issues, thereby improving HPC environment availability and efficiency. [ABSTRACT FROM AUTHOR] |
| Copyright of Computer Engineering & Science / Jisuanji Gongcheng yu Kexue is the property of Computer Engineering & Science and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.) | |
| Database: | Engineering Source |
| FullText | Links: – Type: pdflink Text: Availability: 0 |
|---|---|
| Header | DbId: egs DbLabel: Engineering Source An: 188481648 AccessLevel: 6 PubType: Academic Journal PubTypeId: academicJournal PreciseRelevancyScore: 0 |
| IllustrationInfo | |
| Items | – Name: Title Label: Title Group: Ti Data: A job failure identification method of heterogeneous supercomputing platforms based on semantic analysis of multi-source logs. – Name: Author Label: Authors Group: Au Data: <searchLink fieldCode="AR" term="%22HU%2C+He%22">HU, He</searchLink><relatesTo>1</relatesTo><br /><searchLink fieldCode="AR" term="%22ZHAO%2C+Yi%22">ZHAO, Yi</searchLink><relatesTo>1</relatesTo><i> zhaoyi@cnic.cn</i><br /><searchLink fieldCode="AR" term="%22GU%2C+Beibei%22">GU, Beibei</searchLink><relatesTo>1,2</relatesTo><br /><searchLink fieldCode="AR" term="%22ZHAO%2C+Yunqing%22">ZHAO, Yunqing</searchLink><relatesTo>1</relatesTo> – Name: TitleSource Label: Source Group: Src Data: <searchLink fieldCode="JN" term="%22Computer+Engineering+%26+Science+%2F+Jisuanji+Gongcheng+yu+Kexue%22">Computer Engineering & Science / Jisuanji Gongcheng yu Kexue</searchLink>. Sep2025, Vol. 47 Issue 9, p1535-1543. 9p. – Name: Subject Label: Subjects Group: Su Data: <searchLink fieldCode="DE" term="%22Heterogeneous+computing%22">Heterogeneous computing</searchLink><br /><searchLink fieldCode="DE" term="%22Anomaly+detection+%28Computer+security%29%22">Anomaly detection (Computer security)</searchLink><br /><searchLink fieldCode="DE" term="%22Text+mining%22">Text mining</searchLink><br /><searchLink fieldCode="DE" term="%22Distributed+computing%22">Distributed computing</searchLink><br /><searchLink fieldCode="DE" term="%22Data+mining%22">Data mining</searchLink> – Name: Abstract Label: Abstract Group: Ab Data: This paper presents a method for detecting job anomalies in large-scale distributed HPC heterogeneous platforms. Analyzing job runtime logs is vital for detecting anomalies, but the sheer volume of logs hinders human comprehension. To address this, we introduce a multi-source log semantic analysis approach using latent Dirichlet allocation (LDA) to analyze logs from various sources. By modeling topic evolution over time and matching with historical faulty job patterns, it predicts anomalies. Experiments on a domestic HPC platform show 95.2% precision, enhancing predictive capability and aiding users and administrators in quickly diagnosing issues, thereby improving HPC environment availability and efficiency. [ABSTRACT FROM AUTHOR] – Name: AbstractSuppliedCopyright Label: Group: Ab Data: <i>Copyright of Computer Engineering & Science / Jisuanji Gongcheng yu Kexue is the property of Computer Engineering & Science and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract.</i> (Copyright applies to all Abstracts.) |
| PLink | https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=egs&AN=188481648 |
| RecordInfo | BibRecord: BibEntity: Identifiers: – Type: doi Value: 10.3969/j.issn.1007-130X.2025.09.002 Languages: – Code: chi Text: Chinese PhysicalDescription: Pagination: PageCount: 9 StartPage: 1535 Subjects: – SubjectFull: Heterogeneous computing Type: general – SubjectFull: Anomaly detection (Computer security) Type: general – SubjectFull: Text mining Type: general – SubjectFull: Distributed computing Type: general – SubjectFull: Data mining Type: general Titles: – TitleFull: A job failure identification method of heterogeneous supercomputing platforms based on semantic analysis of multi-source logs. Type: main BibRelationships: HasContributorRelationships: – PersonEntity: Name: NameFull: HU, He – PersonEntity: Name: NameFull: ZHAO, Yi – PersonEntity: Name: NameFull: GU, Beibei – PersonEntity: Name: NameFull: ZHAO, Yunqing IsPartOfRelationships: – BibEntity: Dates: – D: 01 M: 09 Text: Sep2025 Type: published Y: 2025 Identifiers: – Type: issn-print Value: 1007130X Numbering: – Type: volume Value: 47 – Type: issue Value: 9 Titles: – TitleFull: Computer Engineering & Science / Jisuanji Gongcheng yu Kexue Type: main |
| ResultId | 1 |