TY - JOUR
T1 - Investigating the use of exemplary data for software vulnerability prediction
AU - Kudjo, Patrick Kwaku
AU - Mensah, Solomon
AU - Owusu, Ebenezer
AU - Appati, Justice Kwame
N1 - Publisher Copyright:
© The Author(s) under exclusive licence to The Society for Reliability Engineering, Quality and Operations Management (SREQOM), India and The Division of Operation and Maintenance, Lulea University of Technology, Sweden 2025.
PY - 2025
Y1 - 2025
N2 - Vulnerability prediction models (VPMs) are statistical machine learning algorithms that are trained to identify vulnerable components in large software systems. Recently, a wide range of software metrics, like the number of dependencies and the size of code between modules, have been evaluated as potential indicators (i.e., features) for building VPMs. Notwithstanding the success achieved by these approaches, none of these models has performed better in vulnerability prediction. This study aims to investigate the use of exemplary data (i.e., Bellwether instances) for vulnerability prediction. Thus, this study explores the impact of Bellwether on VPMs. Specifically, we use n-grams to identify features of vulnerable Java code for improved prediction accuracy. We evaluate our approach on ten Java Android applications extracted from the F-Droid repository. Six machine learning algorithms are used, and the prediction results are evaluated in terms of precision, recall, F-measure, ROC-AUC, and Yuen’s statistical test. The finding indicates that the Bellwether method outperformed the growing portfolio with F-measure values ranging from 18.5 to 94.4% across the studied datasets, respectively. We found that the Decision tree emerged as the best model (AUC value of 0.81) compared with the other classifiers when trained with Bellwether instances. Hence, we recommend the application of Bellwether instances when setting up VPMs.
AB - Vulnerability prediction models (VPMs) are statistical machine learning algorithms that are trained to identify vulnerable components in large software systems. Recently, a wide range of software metrics, like the number of dependencies and the size of code between modules, have been evaluated as potential indicators (i.e., features) for building VPMs. Notwithstanding the success achieved by these approaches, none of these models has performed better in vulnerability prediction. This study aims to investigate the use of exemplary data (i.e., Bellwether instances) for vulnerability prediction. Thus, this study explores the impact of Bellwether on VPMs. Specifically, we use n-grams to identify features of vulnerable Java code for improved prediction accuracy. We evaluate our approach on ten Java Android applications extracted from the F-Droid repository. Six machine learning algorithms are used, and the prediction results are evaluated in terms of precision, recall, F-measure, ROC-AUC, and Yuen’s statistical test. The finding indicates that the Bellwether method outperformed the growing portfolio with F-measure values ranging from 18.5 to 94.4% across the studied datasets, respectively. We found that the Decision tree emerged as the best model (AUC value of 0.81) compared with the other classifiers when trained with Bellwether instances. Hence, we recommend the application of Bellwether instances when setting up VPMs.
KW - Bellwether method
KW - Classification
KW - Exemplary data
KW - N-gram
KW - Vulnerability prediction
UR - https://www.scopus.com/pages/publications/105019653422
U2 - 10.1007/s13198-025-03017-7
DO - 10.1007/s13198-025-03017-7
M3 - Article
AN - SCOPUS:105019653422
SN - 0975-6809
JO - International Journal of System Assurance Engineering and Management
JF - International Journal of System Assurance Engineering and Management
ER -