TY - JOUR
T1 - An automatic software vulnerability classification framework using term frequency-inverse gravity moment and feature selection
AU - Chen, Jinfu
AU - Kudjo, Patrick Kwaku
AU - Mensah, Solomon
AU - Brown, Selasie Aformaley
AU - Akorfu, George
N1 - Publisher Copyright:
© 2020 Elsevier Inc.
PY - 2020/9
Y1 - 2020/9
N2 - Vulnerability classification is an important activity in software development and software quality maintenance. A typical vulnerability classification model usually involves a stage of term selection, in which the relevant terms are identified via feature selection. It also involves a stage of term-weighting, in which the document weights for the selected terms are computed, and a stage for classifier learning. Generally, the term frequency-inverse document frequency (TF-IDF) model is the most widely used term-weighting metric for vulnerability classification. However, several issues hinder the effectiveness of the TF-IDF model for document classification. To address this problem, we propose and evaluate a general framework for vulnerability severity classification using the term frequency-inverse gravity moment (TF-IGM). Specifically, we extensively compare the term frequency-inverse gravity moment, term frequency-inverse document frequency, and information gain feature selection using five machine learning algorithms on ten vulnerable software applications containing a total number of 27,248 security vulnerabilities. The experimental result shows that: (i) the TF-IGM model is a promising term weighting metric for vulnerability classification compared to the classical term-weighting metric, (ii) the effectiveness of feature selection on vulnerability classification varies significantly across the studied datasets and (iii) feature selection improves vulnerability classification.
AB - Vulnerability classification is an important activity in software development and software quality maintenance. A typical vulnerability classification model usually involves a stage of term selection, in which the relevant terms are identified via feature selection. It also involves a stage of term-weighting, in which the document weights for the selected terms are computed, and a stage for classifier learning. Generally, the term frequency-inverse document frequency (TF-IDF) model is the most widely used term-weighting metric for vulnerability classification. However, several issues hinder the effectiveness of the TF-IDF model for document classification. To address this problem, we propose and evaluate a general framework for vulnerability severity classification using the term frequency-inverse gravity moment (TF-IGM). Specifically, we extensively compare the term frequency-inverse gravity moment, term frequency-inverse document frequency, and information gain feature selection using five machine learning algorithms on ten vulnerable software applications containing a total number of 27,248 security vulnerabilities. The experimental result shows that: (i) the TF-IGM model is a promising term weighting metric for vulnerability classification compared to the classical term-weighting metric, (ii) the effectiveness of feature selection on vulnerability classification varies significantly across the studied datasets and (iii) feature selection improves vulnerability classification.
KW - Classification
KW - Feature selection
KW - Machine learning algorithms
KW - Severity
KW - Software vulnerability
KW - Term-weighting
UR - http://www.scopus.com/inward/record.url?scp=85085160465&partnerID=8YFLogxK
U2 - 10.1016/j.jss.2020.110616
DO - 10.1016/j.jss.2020.110616
M3 - Article
AN - SCOPUS:85085160465
SN - 0164-1212
VL - 167
JO - Journal of Systems and Software
JF - Journal of Systems and Software
M1 - 110616
ER -