TY - JOUR
T1 - An Empirical Study on Small-Sized Datasets Based on Eubank’s Optimal Spacing Theorem
AU - Abedu, Samuel
AU - Mensah, Solomon
AU - Boafo, Frederick
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd. 2024.
PY - 2025/1
Y1 - 2025/1
N2 - Conventional machine learning methods for software effort estimation (SEE) have seen an increase in research interest. Conversely, there are few research that try to evaluate how well deep learning techniques work in SEE. This can be attributed to the relatively small sizes of SEE datasets. The goal of the study is to establish a threshold for small-sized datasets in SEE. Additionally, it looks into how well certain deep learning and traditional machine learning models perform on small-sized datasets. From the body of existing literature, plausible SEE datasets are extracted and ranked along with their attributes and number of project cases. The ranking of the project instances is discretized into three classes (small, medium, and large) using Eubank’s optimal spacing theory. Using the leave-one-out cross-validation, each small-sized dataset is used to train two deep learning models and five conventional machine learning models. Each model’s ability to make predictions is evaluated using its mean absolute error. Results show that, on small-scale datasets, deep learning models outperform traditional machine learning models in terms of prediction accuracy, which contradicts what is previously known. Regularisation techniques can be used in conjunction with deep learning to address SEE.
AB - Conventional machine learning methods for software effort estimation (SEE) have seen an increase in research interest. Conversely, there are few research that try to evaluate how well deep learning techniques work in SEE. This can be attributed to the relatively small sizes of SEE datasets. The goal of the study is to establish a threshold for small-sized datasets in SEE. Additionally, it looks into how well certain deep learning and traditional machine learning models perform on small-sized datasets. From the body of existing literature, plausible SEE datasets are extracted and ranked along with their attributes and number of project cases. The ranking of the project instances is discretized into three classes (small, medium, and large) using Eubank’s optimal spacing theory. Using the leave-one-out cross-validation, each small-sized dataset is used to train two deep learning models and five conventional machine learning models. Each model’s ability to make predictions is evaluated using its mean absolute error. Results show that, on small-scale datasets, deep learning models outperform traditional machine learning models in terms of prediction accuracy, which contradicts what is previously known. Regularisation techniques can be used in conjunction with deep learning to address SEE.
KW - Deep learning
KW - Eubank’s optimal spacing theory
KW - Small-sized
KW - Software effort estimation
KW - Traditional Machine learning
UR - http://www.scopus.com/inward/record.url?scp=85212428342&partnerID=8YFLogxK
U2 - 10.1007/s42979-024-03517-6
DO - 10.1007/s42979-024-03517-6
M3 - Article
AN - SCOPUS:85212428342
SN - 2662-995X
VL - 6
JO - SN Computer Science
JF - SN Computer Science
IS - 1
M1 - 1
ER -