TY - JOUR
T1 - MAHAKIL
T2 - Diversity Based Oversampling Approach to Alleviate the Class Imbalance Issue in Software Defect Prediction
AU - Bennin, Kwabena Ebo
AU - Keung, Jacky
AU - Phannachitta, Passakorn
AU - Monden, Akito
AU - Mensah, Solomon
N1 - Publisher Copyright:
© 1976-2012 IEEE.
PY - 2018/6/1
Y1 - 2018/6/1
N2 - Highly imbalanced data typically make accurate predictions difficult. Unfortunately, software defect datasets tend to have fewer defective modules than non-defective modules. Synthetic oversampling approaches address this concern by creating new minority defective modules to balance the class distribution before a model is trained. Notwithstanding the successes achieved by these approaches, they mostly result in over-generalization (high rates of false alarms) and generate near-duplicated data instances (less diverse data). In this study, we introduce MAHAKIL, a novel and efficient synthetic oversampling approach for software defect datasets that is based on the chromosomal theory of inheritance. Exploiting this theory, MAHAKIL interprets two distinct sub-classes as parents and generates a new instance that inherits different traits from each parent and contributes to the diversity within the data distribution. We extensively compare MAHAKIL with SMOTE, Borderline-SMOTE, ADASYN, Random Oversampling and the No sampling approach using 20 releases of defect datasets from the PROMISE repository and five prediction models. Our experiments indicate that MAHAKIL improves the prediction performance for all the models and achieves better and more significant pf values than the other oversampling approaches, based on Brunner's statistical significance test and Cliff's effect sizes. Therefore, MAHAKIL is strongly recommended as an efficient alternative for defect prediction models built on highly imbalanced datasets.
AB - Highly imbalanced data typically make accurate predictions difficult. Unfortunately, software defect datasets tend to have fewer defective modules than non-defective modules. Synthetic oversampling approaches address this concern by creating new minority defective modules to balance the class distribution before a model is trained. Notwithstanding the successes achieved by these approaches, they mostly result in over-generalization (high rates of false alarms) and generate near-duplicated data instances (less diverse data). In this study, we introduce MAHAKIL, a novel and efficient synthetic oversampling approach for software defect datasets that is based on the chromosomal theory of inheritance. Exploiting this theory, MAHAKIL interprets two distinct sub-classes as parents and generates a new instance that inherits different traits from each parent and contributes to the diversity within the data distribution. We extensively compare MAHAKIL with SMOTE, Borderline-SMOTE, ADASYN, Random Oversampling and the No sampling approach using 20 releases of defect datasets from the PROMISE repository and five prediction models. Our experiments indicate that MAHAKIL improves the prediction performance for all the models and achieves better and more significant pf values than the other oversampling approaches, based on Brunner's statistical significance test and Cliff's effect sizes. Therefore, MAHAKIL is strongly recommended as an efficient alternative for defect prediction models built on highly imbalanced datasets.
KW - Software defect prediction
KW - class imbalance learning
KW - classification problems
KW - data sampling methods
KW - synthetic sample generation
UR - http://www.scopus.com/inward/record.url?scp=85028936214&partnerID=8YFLogxK
U2 - 10.1109/TSE.2017.2731766
DO - 10.1109/TSE.2017.2731766
M3 - Article
AN - SCOPUS:85028936214
SN - 0098-5589
VL - 44
SP - 534
EP - 550
JO - IEEE Transactions on Software Engineering
JF - IEEE Transactions on Software Engineering
IS - 6
ER -