A novel data balancing technique via resampling majority and minority classes toward effective classification

Classification is a predictive modelling task in machine learning (ML), where the class label is determined for a specific example of predefined features. In determining handwriting characters, identifying spam, detecting disease, identifying signals, and so on, classification requires training data with many features and label instances. In medical informatics, high precision and recall are mandatory issues besides the high accuracy of the ML classifiers. Most of the real-life datasets have imbalanced characteristics that hamper the overall performance of the classifiers. Existing data balancing techniques perform the whole dataset at a time that sometimes causes overfitting and underfitting. We propose a data balancing technique that follows the divide and conquer procedure to cluster the dataset into several segments, and both oversampling and undersam-pling operation is performed on each cluster. Finally, the cluster joined together and built a balanced dataset. We chose the sample data of two heart disease datasets: Hungarian and Long Beach. Logistic regression and random forest classifier are the representatives of ML algorithms. We compare our proposed techniques with existing SMOTE, NearMiss, and SMOTETomek data balancing techniques. Both algorithms perform better on the proposed technique-balanced dataset. This technique can be the optimal solution for the imbalanced data handling strategy.


INTRODUCTION
In practical use, classification models frequently face the imbalanced dataset problem, where the number of instances from the majority class is substantially more than those from the minority class, preventing the model from learning well from the minority class [1].When the minority group's contributions to a dataset are increasingly crucial, such as disease diagnosis, churn, or fraud identification, this becomes a significant issue.Both oversampling the minority group and undersampling the majority group are standard methods for addressing this imbalanced dataset issue.The problem is that each of these methods has its weaknesses.The principle behind the oversampling vanilla method is to replicate a subset of the minority class at random; as a result, this approach does not generate any novel insights [2].Undersampling involves deleting some random samples from the majority class, which results in losing some information in the original data.When the dataset is TELKOMNIKA Telecommun Comput El Control ❒ 1309 highly imbalanced, oversampling creates massive synthetic data for the minority class that reduce the variance of the class, causing oversampling and increasing the bias during classification [3].Oversampling sometimes creates model overfitting, and undersampling causes the loss of information and reduces the performance of the classifiers [4].Existing hybrid oversampling and undersampling methods try to fix the issues, but it fails for the data distributions in some domains, namely healthcare informatics, biostatistics, and bioinformatics.The instances of individual classes are close to each other and sometimes overlap.It misguides the machine learning (ML) classifiers during the time of classification and creates ambiguity during the learning stage of the ML models.Dataset balancing is one of the powerful preprocessing techniques in ML.Many researchers use this concept in different domains.Among many of the work in this research, Batista et al. [5] perform a comprehensive experimental evaluation comparing ten techniques dealing with the class imbalance problem on thirteen University of California (UCI) datasets.They found through their experiments that class differences do not consistently reduce the efficiency of learning systems.To detect the code smells, some researchers use ML and found this procedure offers a minimum performance due to the high imbalance characteristics of the dataset.They use synthetic minority oversampling technique (SMOTE) in preprocessing stage and conclude that data balancing does not dramatically improve the performance of the models [6].The extended work of the same researchers [7] uses five different data-balancing techniques and shows their impact on code smell detection in object-oriented systems.The results demonstrate that skipping the balancing stage does not significantly impact accuracy.In another study Lemaıtre et al. [8] present the imbalanced-learn application programming interface (API), a Python toolbox to handle the imbalance datasets in ML.They discuss several existing data-balancing techniques and compare the models in binary and multiclass data balancing; additionally, they also present the techniques of the methods, either oversampling or undersampling.In heart disease prediction [9], this study uses a hybrid approach combining SMOTE with edited nearest neighbor (ENN) to balance the dataset.[13].The high biases of the balanced data show unstable classification reports for different classes.Sometimes, the accuracy is satisfactory, but the precision and recall show huge fluctuations among the classes that could be better for the classifier's performance.To mitigate these issues in this study, − We proposed a divide-and-conquer-based data balancing technique that controls the classifier's performance's stability, high accuracy and prevents overfitting and underfitting.− To check the performance of the proposed data balancing technique, we evaluate all possible combinations of all balancing techniques and classifiers.− We treat the noisy, missing values using random forest regression to turn the dataset to be more MLtrainable.

PROPOSED METHOD 2.1. Method overview
In this study, we take two healthcare informatics data as the sample of imbalanced data.Our proposed methodology includes data preprocessing, and we apply the data balancing techniques to the dataset individually and fit the balanced dataset to the ML algorithm logistic regression (LR) and random forest classifier (RFC).The top-down view of the proposed method is in Figure 1.We evaluate the performance of each combination using accuracy, precision, recall, and f1 score and finally show the combinations' receiver operating characteristic (ROC).To check the stability of the dataset in the different folds, we use stratified K-fold crossvalidation on imbalance data and K-fold cross-validation on the balanced datasets.We compare the result of

Preprocessing techniques
The two datasets are noisy and have many values that need to be added.We apply random forest regression [15] to find the value in missing places and fill in the data.The missing position is considered a dependent variable, and other features are independent variables; then, the regression output is put in the missing value section.This process performs on the columns that have a large number of missing values.For the columns that contain a few missing values, we handle it to fill the data by the arithmetic mean of the column.
We also check the outliers of the datasets and remove the outlier using turkey fences [16].The dataset is split into three quartiles: Q1, Q2, and Q3.The first quartile, or Q1, is the value within the data set comprising 25% values below it.The third quartile, or Q3, is the value that accounts for 25% of the values above it.Outliers are also valued below or above the lower or upper limits, as (1) and (2): Outliers that fall below the lower limit are replaced with a lower limit, and outliers above the upper limit are replaced with an upper limit.

Baseline data balancing techniques
The data are balanced using SMOTE, NearMiss, and combined over and undersampling techniques SMOTETomek.The effects of each technique on our suggested method are tabulated in the results section.
SMOTE: to address imbalanced data, the SMOTE stands out as a widely utilized approach [17].This technique involves creating synthetic instances for the underrepresented class, enhancing the dataset without sacrificing information from the original records, thus contributing additional data points.
NearMiss: one common approach taken by NearMiss to rectify the problem of skewed data was to employ an undersampling machine learning method.It eliminates random samples from the majority group, which can lead to data loss.Hence, an underfitting model problem may result from a specific scenario.SMOTETomek: to deal with unbalanced datasets, SMOTETOMEK employs a hybrid ML strategy [18].It is a hybrid method that employs both undersampling and oversampling.As a result, the performance measures used for classifying data either move up or down depending on the dataset's underlying statistical properties.

Proposed data balancing technique
The above data-balancing techniques use either oversampling or undersampling to balance the dataset, where the total dataset considers as a cluster and performs the balancing operation.In our proposed data balancing techniques, we divide the dataset into several clusters based on the data characteristics determined by K-means clustering [19].We balance each cluster separately, then merge the individual cluster and create the final balance dataset.In each cluster, we find the majority and minority class first, then apply the resampling techniques.In existing approaches, the majority and minority class are fixed, but in our proposed techniques, the majority and minority class changes based on the data sample of individual clusters.In each cluster, we choose random data from the minority class and calculate the distance between the random data and its k nearest neighbors.Then, we multiply the distance by a random number between 0 and 1 and add the new data point as a synthetic sample for the minority class.This step is continued until the minority class meets the desired proportion.Then, we choose another random data from the majority class and check the nearest neighbors of the random data point.If the neighbor data are from a minority class, we remove the random data point.Selecting the observation x and y needs to fulfil the following properties: − The nearest neighbors of observation x is y − The nearest neighbors of observation y is x − Both x and y belong to a different class.
It means x and y belong to the majority and minority classes, and we select the two as a pair.Consider d(x i , x j ) denotes the Euclidean distance between the data point x i and x j , where x i denotes the minority class sample and x j denotes the majority class sample.If there is no sample x k satisfies the following condition: ) then the pair of d(x i , x j ) is the selected pair.
This technique can be used to identify and eliminate data samples from the majority class with the smallest Euclidean distance to the data from the minority class (i.e. the data from the majority class that is closest to the data from the minority class, thus making it ambiguous to differentiate).

Machine learning algorithms
LR: LR serves as a statistical model that assesses the likelihood of an event transpiring based on an analysis of certain independent variables.This model is particularly geared towards tackling classification challenges [20].Notably, LR entails binary outcomes: the event either materializes or does't come to pass [21].
RFC: RFC represents an expanded iteration of bagging, an ensemble technique, and is fashioned through the amalgamation of numerous decision trees [22].This approach addresses overfitting by opting for a subset of potential features while creating decision trees, in contrast to decision trees considering the entirety of features.The process involves crafting distinct decision trees from the training dataset, subsequently amalgamating these trees' outcomes to yield the ultimate result.In classification tasks, RFC employs a voting mechanism wherein the class with the highest votes count is selected as the final prediction [23].

Performance measure techniques
The purpose of the classification report is to evaluate classifier performance.Accuracy is among the metrics that can be employed to gauge the efficacy of classification algorithms.The precision of the test determines the count of samples predicted to be positive that indeed turn out to be positive [24].This metric proves valuable when the aim is to minimize false positives.Recall serves as an indicator of how effectively optimistic predictions capture positive samples [25].The F1 score, a harmonic average of precision and recall, offers a comprehensive synthesis of both measurement approaches.It holds the potential to outperform accuracy in scenarios involving imbalanced binary classification datasets.Furthermore, the receiver operating characteristics (ROC) curve contrasts the false positive rate (FPR) against the true positive rate (TPR).This graphical representation aids in assessing classifier performance across different thresholds.

RESULT AND DISCUSSION
In this study, we propose a divide-and-conquer-based data balancing technique and compare the classifier's performance to find the superiority of this proposed technique.Firstly, we apply LR and RFC in the imbalance dataset and then use SMOTE.NearMiss, SMOTETomek and proposed balancing techniques one by one and check the performance of the models.
In Table 1, the state of the data presents before and after data balancing.We show the five different states of both datasets.It shows that the datasets are imbalanced, and after applying the balancing techniques, it is balanced; sometimes, the instances increase and sometimes decrease.We use the balanced dataset separately from the ML classifiers and check the results using the classification report.

Performance of the classifiers on imbalance datasets
We apply LR and RFC on both datasets.Table 2 shows the result of Long Beach and Hungarian datasets resulting in an imbalanced state.The ML classifiers' performance is not good, and individual classes' precision, recall, and f1 score are unstable.In Table 2, the precision and recall are comparatively low for class 0 than class 1 for both classifiers.The performance in Table 2 has a significant gap, but it is less than in Table 2.Because the Long Beach data is more imbalanced than the Hungarian dataset, to solve this issue, we need to use data balancing techniques to balance the data to get a stable output and good accuracy.3 and 4. The table indicates that the proposed data balancing technique outperforms other balancing techniques in both classifiers.In the imbalance Long Beach dataset, the accuracy of LR is 78% and RF is 79%, but the other metrics are unstable for both classes.After applying the data balancing techniques, we improved the stability of the other performance measurement techniques, but accuracy fell in SMOTE, NearMiss, and SMOTETomek.Nevertheless, the proposed data-balancing technique shows a different scenario; it improves the classification accuracy and stabilizes the other performance measurement techniques for both classes.It is proof of the superiority of the proposed technique.The performance of each combination is in Figure 2 by a ROC.We also apply the 10-fold cross-validation to imbalance and balanced datasets and get a slight standard deviation in average accuracy.Each fold shows good performance, and most cases are stable.

Performance of the classifiers after balancing in Hungarian dataset
We apply the same methodology in the Hungarian dataset to prove the superiority of the proposed data balancing technique.Like the previous dataset, our proposed technique performs better than other data balancing techniques.In the imbalance phase, LR shows 77%, and RF shows 77% accuracy also.Our proposed methods help the classifiers, and the accuracy goes LR from 77% to 89% and RF from 77% to 91%.Comparing the Table 2 with Table 5 shows that the result is stable in the balanced dataset compared to an imbalanced dataset.Precision and recall are stable in SMOTE balancing, but our proposed techniques show more balanced results and better accuracy of the classifiers.We visualize the ROC of all possible combinations in Figure 3, which shows the superiority of the proposed data balancing technique in

CONCLUSION AND FUTURE WORK
This study proposes a new data-balancing technique to predict heart disease from two well-known datasets.We use LR and RF ML algorithms and first check the performance of the classifiers in the imbalanced dataset.The performance of the classifiers could be better level and precision, and recall could be more stable in both classes.Then, we apply three existing data-balancing techniques and check the performance of the classifiers.The overall performance sometimes falls, but precision and recall are more stable than the imbalanced dataset.In 10-fold cross-validation, the performance of the classifiers is stable, and accuracy fluctuates in a minimum range.The overall result shows that the proposed data balancing techniques outperform the three data balancing techniques, and RF shows better accuracy than LR.This study focuses on the binary classification problem, and the proposed data balancing technique is suitable for binary classification.The further study focuses on multi-label classification, and the different domains will consider as the study area.

❒
ISSN: 1693-6930 each combination with the proposed data balancing techniques combinations and all the results present in the result section.

Figure 1 .
Figure 1.Top-down approach of proposed method

Table 1 .
State of the data sample before and after data balancing

Table 2 .
Performance of the classifier without balancing on Long Beach and Hungarian dataset Performance of the classifiers after balancing in Long Beach dataset Firstly, we use the balancing techniques on the Long Beach dataset and results are tabulated in Tables

Table 6 ,
we show the 10-fold TELKOMNIKA Telecommun Comput El Control, Vol.21, No. 6, December 2023: 1308-1316 TELKOMNIKA Telecommun Comput El Control ❒ 1313cross-validation score of the algorithms and get a small standard deviation in every case.The result clearly indicates that we can choose RF as a classifier to predict heart disease after applying the proposed data balancing technique.

Table 3 .
Performance of the classifiers after balancing on Long Beach dataset

Table 4 .
Stratified K fold and K fold cross validation score for Long Beach dataset

Table 5 .
Performance of the classifiers after balancing on Hungarian dataset Figure 3. ROC curve of all possible combinations in Hungarian dataset