Xpert: Empowering Incident Management with Query Recommendations via Large Language Models.

Saved in:
Bibliographic Details
Title: Xpert: Empowering Incident Management with Query Recommendations via Large Language Models.
Authors: Jiang, Yuxuan1 jyuxuan@umich.edu, Zhang, Chaoyun2 chaoyunzhang@microsoft.com, He, Shilin2 shilin.he@microsoft.com, Yang, Zhihao3 zhihaoyang@stu.pku.edu.cn, Ma, Minghua2 minghuama@microsoft.com, Qin, Si2 si.qin@microsoft.com, Kang, Yu2 yu.kang@microsoft.com, Dang, Yingnong4 dang.yingnong@microsoft.com, Rajmohan, Saravan5 saravan.rajmohan@microsoft.com, Lin, Qingwei2 qlin@microsoft.com, Zhang, Dongmei2 dongmeiz@microsoft.com
Source: ICSE: International Conference on Software Engineering. 2024, p1-13. 13p.
Subjects: Query languages (Computer science), Language models, Domain-specific programming languages, Machine learning, Artificial intelligence
Abstract: Large-scale cloud systems play a pivotal role in modern IT infrastructure. However, incidents occurring within these systems can lead to service disruptions and adversely affect user experience. To swiftly resolve such incidents, on-call engineers depend on crafting domain-specific language (DSL) queries to analyze telemetry data. However, writing these queries can be challenging and time-consuming. This paper presents a thorough empirical study on the utilization of queries of KQL, a DSL employed for incident management in a large-scale cloud management system at Microsoft. The findings obtained underscore the importance and viability of KQL queries recommendation to enhance incident management. Building upon these valuable insights, we introduce Xpert, an end-to-end machine learning framework that automates KQL recommendation process. By leveraging historical incident data and large language models, Xpert generates customized KQL queries tailored to new incidents. Furthermore, Xpert incorporates a novel performance metric called Xcore, enabling a thorough evaluation of query quality from three comprehensive perspectives. We conduct extensive evaluations of Xpert, demonstrating its effectiveness in offline settings. Notably, we deploy Xpert in the real production environment of a large-scale incident management system in Microsoft, validating its efficiency in supporting incident management. To the best of our knowledge, this paper represents the first empirical study of its kind, and Xpert stands as a pioneering DSL query recommendation framework designed for incident management. [ABSTRACT FROM AUTHOR]
Copyright of ICSE: International Conference on Software Engineering is the property of Association for Computing Machinery and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database: Engineering Source
Full text is not displayed to guests.
Description
Abstract:Large-scale cloud systems play a pivotal role in modern IT infrastructure. However, incidents occurring within these systems can lead to service disruptions and adversely affect user experience. To swiftly resolve such incidents, on-call engineers depend on crafting domain-specific language (DSL) queries to analyze telemetry data. However, writing these queries can be challenging and time-consuming. This paper presents a thorough empirical study on the utilization of queries of KQL, a DSL employed for incident management in a large-scale cloud management system at Microsoft. The findings obtained underscore the importance and viability of KQL queries recommendation to enhance incident management. Building upon these valuable insights, we introduce Xpert, an end-to-end machine learning framework that automates KQL recommendation process. By leveraging historical incident data and large language models, Xpert generates customized KQL queries tailored to new incidents. Furthermore, Xpert incorporates a novel performance metric called Xcore, enabling a thorough evaluation of query quality from three comprehensive perspectives. We conduct extensive evaluations of Xpert, demonstrating its effectiveness in offline settings. Notably, we deploy Xpert in the real production environment of a large-scale incident management system in Microsoft, validating its efficiency in supporting incident management. To the best of our knowledge, this paper represents the first empirical study of its kind, and Xpert stands as a pioneering DSL query recommendation framework designed for incident management. [ABSTRACT FROM AUTHOR]
DOI:10.1145/3597503.3639081