Key-based data augmentation with curriculum learning for few-shot code search.
Saved in:
| Title: | Key-based data augmentation with curriculum learning for few-shot code search. |
|---|---|
| Authors: | Zhang, Fan1,2 (AUTHOR) fanzhang@hnu.edu.cn, Peng, Manman1 (AUTHOR) pengmanman@hnu.edu.cn, Wu, Qiang1 (AUTHOR) wuqiang@hnu.edu.cn, Shen, Yuanyuan1 (AUTHOR) shenyuanyuan@hnu.edu.cn |
| Source: | Neural Computing & Applications. Jan2025, Vol. 37 Issue 3, p1475-1490. 16p. |
| Subjects: | Domain-specific programming languages, Data augmentation, Programming languages, Curriculum frameworks, Natural languages |
| Abstract: | Given a natural language query, code search aims to find matching code snippets from a codebase. Recent works are mainly designed for mainstream programming languages with large amounts of training data. However, code search is also needed for domain-specific programming languages, which have fewer training data, and it is a heavy burden to label a large amount of training data for each domain-specific language. To this end, we propose DAFCS, a data augmentation framework with curriculum learning for few-shot code search tasks. Specifically, we first collect unlabeled codes in the same programming language as the original codes, which can provide additional semantic signals to the original codes. Second, we employ an occlusion-based method to identify key statements in code fragments. Third, we design a set of new key-based augmentation operations for the original codes. Finally, we use curriculum learning to reasonably schedule augmented samples for training well-performing models. We conduct retrieval experiments on a public dataset and find that DAFCS surpasses state-of-the-art methods by 5.42% and 5.05% in the Solidity and SQL domain-specific languages, respectively. Our study shows that DAFCS, which adopts data augmentation and curriculum learning strategies, can achieve promising performance in few-shot code search tasks. [ABSTRACT FROM AUTHOR] |
| Copyright of Neural Computing & Applications is the property of Springer Nature and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.) | |
| Database: | Engineering Source |
|
Full text is not displayed to guests.
Login for full access.
|
|
| Abstract: | Given a natural language query, code search aims to find matching code snippets from a codebase. Recent works are mainly designed for mainstream programming languages with large amounts of training data. However, code search is also needed for domain-specific programming languages, which have fewer training data, and it is a heavy burden to label a large amount of training data for each domain-specific language. To this end, we propose DAFCS, a data augmentation framework with curriculum learning for few-shot code search tasks. Specifically, we first collect unlabeled codes in the same programming language as the original codes, which can provide additional semantic signals to the original codes. Second, we employ an occlusion-based method to identify key statements in code fragments. Third, we design a set of new key-based augmentation operations for the original codes. Finally, we use curriculum learning to reasonably schedule augmented samples for training well-performing models. We conduct retrieval experiments on a public dataset and find that DAFCS surpasses state-of-the-art methods by 5.42% and 5.05% in the Solidity and SQL domain-specific languages, respectively. Our study shows that DAFCS, which adopts data augmentation and curriculum learning strategies, can achieve promising performance in few-shot code search tasks. [ABSTRACT FROM AUTHOR] |
|---|---|
| ISSN: | 09410643 |
| DOI: | 10.1007/s00521-024-10670-9 |