Options
How Well Can Masked Language Models Spot Identifiers That Violate Naming Guidelines?
Journal
2023 IEEE 23rd International Working Conference on Source Code Analysis and Manipulation (SCAM)
Type
conference paper
Date Issued
2023
Author(s)
Abstract
Using meaningful identifiers in source code reduces the risk of errors, the cognitive load of developers, and speeds up the development process. Therefore, recent research has looked into an AI-based analysis of identifiers, for which large-scale language models appear to offer great potential. Based on tokens' probabilities, such models can suggest identifiers that are likely to appear in a given context. While current research has used language models to predict the most likely identifier names, studies on assessing the quality of given identifiers are scarce. To this end, we explore adherence to identifier naming guidelines as a proxy for identifier quality and propose and evaluate two unsupervised approaches for spotting violations: First, a generative approach, which uses the probability distribution of the language model directly without fine-tuning. Second, a discriminative method, which fine-tunes the model's encoder to discriminate between original identifiers and similar drop-in replacements suggested by a weak AI. We demonstrate that the proposed approaches can successfully detect violations of common guidelines for identifier naming. To do so, we have developed a dataset built on widely accepted identifier naming guidelines. The manually annotated dataset contains more than 6000 dense annotations of identifiers for 28 common guidelines. Using the data, we show that the generative approach achieves the best results, but that the particular masking strategy and scoring method matter substantially. Also, we demonstrate our approach to outperform other recent code transformers. In a per-guideline analysis, we highlight the potential and limitations of language models, and provide a blueprint for training and evaluating their ability to identify bad identifier names in source code. We make our dataset and models' implementation publicly available to encourage future research on AI-based identifier quality assessment.
Keywords
Masked Language Models
Source Code Identifiers
Naming Guidelines
Code Quality
Identifier Quality Assessment
Generative Approaches
Language Model Fine-tuning
Code Transformers
AI-based Code Analysis
File(s)