A job failure identification method of heterogeneous supercomputing platforms based on semantic analysis of multi-source logs.

Saved in:
Bibliographic Details
Title: A job failure identification method of heterogeneous supercomputing platforms based on semantic analysis of multi-source logs.
Authors: HU, He1, ZHAO, Yi1 zhaoyi@cnic.cn, GU, Beibei1,2, ZHAO, Yunqing1
Source: Computer Engineering & Science / Jisuanji Gongcheng yu Kexue. Sep2025, Vol. 47 Issue 9, p1535-1543. 9p.
Subjects: Heterogeneous computing, Anomaly detection (Computer security), Text mining, Distributed computing, Data mining
Abstract: This paper presents a method for detecting job anomalies in large-scale distributed HPC heterogeneous platforms. Analyzing job runtime logs is vital for detecting anomalies, but the sheer volume of logs hinders human comprehension. To address this, we introduce a multi-source log semantic analysis approach using latent Dirichlet allocation (LDA) to analyze logs from various sources. By modeling topic evolution over time and matching with historical faulty job patterns, it predicts anomalies. Experiments on a domestic HPC platform show 95.2% precision, enhancing predictive capability and aiding users and administrators in quickly diagnosing issues, thereby improving HPC environment availability and efficiency. [ABSTRACT FROM AUTHOR]
Copyright of Computer Engineering & Science / Jisuanji Gongcheng yu Kexue is the property of Computer Engineering & Science and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database: Engineering Source
FullText Links:
  – Type: pdflink
Text:
  Availability: 0
Header DbId: egs
DbLabel: Engineering Source
An: 188481648
AccessLevel: 6
PubType: Academic Journal
PubTypeId: academicJournal
PreciseRelevancyScore: 0
IllustrationInfo
Items – Name: Title
  Label: Title
  Group: Ti
  Data: A job failure identification method of heterogeneous supercomputing platforms based on semantic analysis of multi-source logs.
– Name: Author
  Label: Authors
  Group: Au
  Data: <searchLink fieldCode="AR" term="%22HU%2C+He%22">HU, He</searchLink><relatesTo>1</relatesTo><br /><searchLink fieldCode="AR" term="%22ZHAO%2C+Yi%22">ZHAO, Yi</searchLink><relatesTo>1</relatesTo><i> zhaoyi@cnic.cn</i><br /><searchLink fieldCode="AR" term="%22GU%2C+Beibei%22">GU, Beibei</searchLink><relatesTo>1,2</relatesTo><br /><searchLink fieldCode="AR" term="%22ZHAO%2C+Yunqing%22">ZHAO, Yunqing</searchLink><relatesTo>1</relatesTo>
– Name: TitleSource
  Label: Source
  Group: Src
  Data: <searchLink fieldCode="JN" term="%22Computer+Engineering+%26+Science+%2F+Jisuanji+Gongcheng+yu+Kexue%22">Computer Engineering & Science / Jisuanji Gongcheng yu Kexue</searchLink>. Sep2025, Vol. 47 Issue 9, p1535-1543. 9p.
– Name: Subject
  Label: Subjects
  Group: Su
  Data: <searchLink fieldCode="DE" term="%22Heterogeneous+computing%22">Heterogeneous computing</searchLink><br /><searchLink fieldCode="DE" term="%22Anomaly+detection+%28Computer+security%29%22">Anomaly detection (Computer security)</searchLink><br /><searchLink fieldCode="DE" term="%22Text+mining%22">Text mining</searchLink><br /><searchLink fieldCode="DE" term="%22Distributed+computing%22">Distributed computing</searchLink><br /><searchLink fieldCode="DE" term="%22Data+mining%22">Data mining</searchLink>
– Name: Abstract
  Label: Abstract
  Group: Ab
  Data: This paper presents a method for detecting job anomalies in large-scale distributed HPC heterogeneous platforms. Analyzing job runtime logs is vital for detecting anomalies, but the sheer volume of logs hinders human comprehension. To address this, we introduce a multi-source log semantic analysis approach using latent Dirichlet allocation (LDA) to analyze logs from various sources. By modeling topic evolution over time and matching with historical faulty job patterns, it predicts anomalies. Experiments on a domestic HPC platform show 95.2% precision, enhancing predictive capability and aiding users and administrators in quickly diagnosing issues, thereby improving HPC environment availability and efficiency. [ABSTRACT FROM AUTHOR]
– Name: AbstractSuppliedCopyright
  Label:
  Group: Ab
  Data: <i>Copyright of Computer Engineering & Science / Jisuanji Gongcheng yu Kexue is the property of Computer Engineering & Science and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract.</i> (Copyright applies to all Abstracts.)
PLink https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=egs&AN=188481648
RecordInfo BibRecord:
  BibEntity:
    Identifiers:
      – Type: doi
        Value: 10.3969/j.issn.1007-130X.2025.09.002
    Languages:
      – Code: chi
        Text: Chinese
    PhysicalDescription:
      Pagination:
        PageCount: 9
        StartPage: 1535
    Subjects:
      – SubjectFull: Heterogeneous computing
        Type: general
      – SubjectFull: Anomaly detection (Computer security)
        Type: general
      – SubjectFull: Text mining
        Type: general
      – SubjectFull: Distributed computing
        Type: general
      – SubjectFull: Data mining
        Type: general
    Titles:
      – TitleFull: A job failure identification method of heterogeneous supercomputing platforms based on semantic analysis of multi-source logs.
        Type: main
  BibRelationships:
    HasContributorRelationships:
      – PersonEntity:
          Name:
            NameFull: HU, He
      – PersonEntity:
          Name:
            NameFull: ZHAO, Yi
      – PersonEntity:
          Name:
            NameFull: GU, Beibei
      – PersonEntity:
          Name:
            NameFull: ZHAO, Yunqing
    IsPartOfRelationships:
      – BibEntity:
          Dates:
            – D: 01
              M: 09
              Text: Sep2025
              Type: published
              Y: 2025
          Identifiers:
            – Type: issn-print
              Value: 1007130X
          Numbering:
            – Type: volume
              Value: 47
            – Type: issue
              Value: 9
          Titles:
            – TitleFull: Computer Engineering & Science / Jisuanji Gongcheng yu Kexue
              Type: main
ResultId 1