欢迎访问《上海师范大学学报（自然科学版）》编辑部网站

期刊社

首页

编委会

English

引用本文:

【打印本页】【下载PDF全文】【查看/发表评论】【EndNote】【RefMan】【BibTex】

过刊浏览高级检索

本文已被：浏览 41次下载 0次
投稿日期：2024-10-13 录用日期：2024-10-30 最后修改日期：2024-10-22
分享到：微信更多字体:加大+\|默认\|缩小-
基于集成学习的Python代码异味检测方法
曹悦, 陈军华
上海师范大学信息与机电工程学院

摘要:

Python已应用于各种软件项目和领域,Python中的代码异味会显着影响可维护性、可理解性、可测试性问题。本文提出了一种基于集成学习的Python程序代码异味检测方法,用于检测五种代码异味(长方法、大类、长基类列表、长参数列表和长作用域链接),该方法使用堆叠集成法和投票集成法两种集成学习方法检测代码异味,并与Adaboost、SVM、GBT、DT、KNN和GNB六种机器学习模型进行比较。集成学习方法通过构建并组合多个分类器来解决同一个问题,目的是通过这种方法提高整体的学习精度、泛化能力和模型的健壮性,减少过拟合的风险。实验使用数据集包含每个代码异味的1000个样本,其中18个特征从源代码中提取。为了评估集成学习方法的性能,应用了10倍交叉验证技术,该技术通过将原始数据集划分为训练模型的训练集和评估模型的测试集来预测模型。为了提高方法的精度,本文采用基于网格搜索的参数优化技术。实验结果表明,使用堆叠集成法在检测长基类列表和长参数列表两种代码异味表现出了很好的检测效果,F1值分别达到了0.97和0.89的分数；使用投票集成法在检测长方法、大类、长基类列表和长参数列表四种代码异味都表现出了很好的效果,F1值分别达到了0.94、0.83、0.97、0.92的分数,投票集成法比堆叠集成法检测效果更好。并且无论是机器学习模型还是集成学习方法,使用网格搜索技术比10倍交叉验证技术的精度更高。

关键词: Python 集成学习代码度量代码异味

DOI：

分类号:TP311

基金项目:国家自然科学基金(61672355)

A Python code smell detection method based on ensemble learning

caoyue, chenjunhua

Shanghai Normal University

Abstract:

Python has been applied to various software projects and fields. Code smell in Python can significantly affect maintainability, comprehensibility and testability. This paper proposes a Python code smell detection method based on ensemble learning, which is used to detect five types of code smells (LongMethod, LongClass, LongBaseClassList, LongParameterList, LongScopeChaining). The method uses two ensemble learning methods, stack ensemble method and vote ensemble method, to detect code smell, and compares with six machine learning models, Adaboost, SVM, GBT, DT, KNN and GNB. The ensemble learning method solves the same problem by constructing and combining multiple classifiers, with the aim of improving overall learning accuracy, generalization ability, model robustness, and reducing the risk of overfitting. The experimental dataset contains 1000 samples of each code smell, of which 18 features are extracted from the source code. To evaluate the performance of ensemble learning methods, 10-fold cross validation technique was applied, which predicts the model by dividing the original dataset into a training set for the training model and a testing set for the evaluation model. In order to improve the accuracy of the method, this article adopts a parameter optimization technique based on grid search. The results show that the use of stack ensemble method has shown good detection performance in detecting code smells in LongBaseClassList and LongParameterList, with F1 values reaching scores of 0.97 and 0.89, respectively; The use of vote ensemble method has shown good results in detecting code smells in four types of code: LongMethod, LongClass, LongBaseClassList, and LongParameterList. The F1 values have reached scores of 0.94, 0.83, 0.97, and 0.92, respectively. The vote ensemble method has better detection performance than the stack ensemble method. And whether it is machine learning models or ensemble learning methods, using grid search technology has higher accuracy than using 10-fold cross validation technology.

Key words: Python Ensemble learning Code metrics Code smells