A novel data balancing technique via resampling majority and minority classes toward effective classification

Mahmudul Hasan, Md. Fazle Rabbi, Md. Nahid Sultan, Adiba Mahjabin Nitu, Md. Palash Uddin

Abstract


Classification is a predictive modelling task in machine learning (ML), where the class label is determined for a specific example of predefined features. In determining handwriting characters, identifying spam, detecting disease, identifying signals, and so on, classification requires training data with many features and label instances. In medical informatics, high precision and recall are mandatory issues besides the high accuracy of the ML classifiers. Most of the real-life datasets have imbalanced characteristics that hamper the overall performance of the classifiers. Existing data balancing techniques perform the whole dataset at a time that sometimes causes overfitting and underfitting. We propose a data balancing technique that follows the divide and conquer procedure to cluster the dataset into several segments, and both oversampling and undersampling operation is performed on each cluster. Finally, the cluster joined together and built a balanced dataset. We chose the sample data of two heart disease datasets: Hungarian and Long Beach. Logistic regression and random forest classifier are the representatives of ML algorithms. We compare our proposed techniques with existing SMOTE, NearMiss, and SMOTETomek data balancing techniques. Both algorithms perform better on the proposed technique-balanced dataset. This technique can be the optimal solution for the imbalanced data handling strategy.

Keywords


divide and conquer; heart disease prediction; imbalance data handling; machine learning; medical informatics;

Full Text:

PDF


DOI: http://doi.org/10.12928/telkomnika.v21i6.25211

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

TELKOMNIKA Telecommunication, Computing, Electronics and Control
ISSN: 1693-6930, e-ISSN: 2302-9293
Universitas Ahmad Dahlan, 4th Campus
Jl. Ringroad Selatan, Kragilan, Tamanan, Banguntapan, Bantul, Yogyakarta, Indonesia 55191
Phone: +62 (274) 563515, 511830, 379418, 371120
Fax: +62 274 564604

View TELKOMNIKA Stats