Enhancement of student performance prediction using modified K-nearest neighbor

ABSTRACT


INTRODUCTION
The growth of the internet and communication technologies have contributed to the dissemination of e-learning to support certain countries confronting a rising scarcity of instructors [1].It is evident that understudies, or individuals in general, who are looking for information can achieve this effectively and with minimal effort at any time and anywhere.This encourages various colleges and instructive organizations to adopt an online learning framework with an extension of the student data volume.However, e-learning has a lot of impediments and challenges and drop-out rates for students are more common than conventional learning [2,3].
The educational data mining (EDM) is utilized to develop a model that can influence e-learning system because data gathered from e-learning system often exceeds large numbers of students [4].The involvement of different variables that EDM can exploit for model building helps the educational system to perform better.The developed model can support the decision making of educational institutions and universities about future of their students, for example, distinguishing effective students from a given course and perceiving students who will drop out or fail to pay more consideration during course progress [5].The K-the nearest neighbor is one of the simplest EDM algorithms [6].It is computationally simple based on similarity measures such as a Euclidean distance metric with majority voting of the K closest training sample class assigned to the test sample [7].KNN is an instance-based learner, sometimes called a lazy learner, as it defers the training until a new student (test sample) should be classified (i.e.there is no training phase) with most the power relies on matching scheme [8].KNN has some cons that can be listed as [9,10]: -Computational overhead is extremely high as each new student needs to calculate the distance to all training samples.-The capacity requirement is huge in proportion to the training size set.
-KNN with multidimensional data sets has a minimum accuracy rate.
Researchers offered a variety of techniques for dealing with the issues of traditional KNN algorithm and improving its performance.The authors of [11] proposed that the genetic algorithm (GA) and KNN were combined to improve the classification performance.GA was used to instantly pick up k-neighbors and calculate the distance to classify the test samples.The proposed method was compared with the traditional KNN, CART and SVM classifiers.The results showed that the proposed method reduced complexity and improve accuracy.
The authors in [12] solved the large sample computation problem using a cure clustering algorithm with KNN to obtain representative samples of the original dataset for text categorization.The proposed method classified 6500 news essays from 8 categories of Sina websites with improved computation speed compared to traditional KNN but did not enhance the accuracy of KNN, which is considered as a major limitation of the proposed method.
The author in [13] focused on improving the performance of KNN by combining local mean based KNN with distance weight KNN.The proposed method was applied to four datasets from UCI, kaggle, and keel, in addition to a real dataset from public senior high school.The obtained results appeared that the classification accuracy of the proposed method compared to KNN was increased, but this research ignored the complexity of execution time resulted from the mixing of proposed methods.
The KNN computational complexity for classifying a single new instance is O(n), where n is a number of training samples [14].Therefore, in this study, the prototype storage, computation time and accuracy have a great deal of analysis.This paper proposed a solution by introducing an acceleration scheme to overcome KNN drawbacks via a combination of moment descriptors with traditional KNN.The moment descriptors have been utilized well in multimedia research for various applications, such as musical similarity and song year prediction [15], speed up color image fractal compression [16] and enhance fractal audio compression [17].
The training set will be arranged into subsets; samples belong to the same subset have similar descriptor number.The proposed FKNN does not have to test each new sample (i.e., compute its distance) with all training samples.but, each test sample (new student) when the proposed FKNN computes its descriptor value is matched only with a predetermined subset of training samples which has similar descriptor value.This significantly reduces the execution time (comparison distance time) and memory requirements.In addition, each training subset is formed on the basis of a weighted moment descriptor that captures the importance of selected attributes for different samples, this enables each training subset to contain the most similar samples.It, in turn, increases the accuracy of a classification and avoids double majority classification (i.e.misclassification).

RESEARCH METHOD
This study will include two phases as a part of the methodology, as follow:

Dataset collection and preparation
The collection, preprocessing and feature selection of data sets are done based on our research work of [18].This study used three datasets, the first being the Iraqi student performance prediction dataset, which is collected through applying (or submitting) questionnaire in three Iraqi secondary schools for both applicable and biology branches of the final stage during the second semester of the 2018 year and uploaded to [19] with full description.While the second and third datasets (student alcohol consumption dataset), are obtained from UCI Portugal [20], which incorporates two datasets: student-mat.csvand student-por.csv.Dataset preprocessing includes the following steps: -Dataset encoding: the dataset contains attributes of various data types, for instance: binary, interval, numeric and categorical (nominal, ordinal).The KNN requires data to be in the numerical formulation.This is due to that there are many feature encoding methods for transforming categorical data to numeric ones, such as label encoding or integer encoding, one-hot encoding, binarized and hashing.In this research, the datasets are encoded using Label Encoder, which is the most common method to transform categorical features into numerical labels.Numerical labels are always being between 0 and (#attribute_value-1).


Enhancement of student performance prediction using modified K-nearest neighbor (Saja Taha Ahmed)

The proposed method
In this study, the proposed FKNN utilizes the concept of moment descriptor which is a set of parameters that describe the distribution of material [21].The main idea is the similarity between attributes value of new student (test sample) and previously registered students (trained examples), since if two samples have same descriptors they are going to have approximately similar performance.From this point of view, this research comes out with the contribution of enhancing the performance of KNN by employing moment descriptor to pre-classify students.This strategy uses the descriptors as a reference indicator to pre-classify the training samples into groups with a specific descriptor value based on social and academic factors.The reason for adopting this classification concept is that the descriptor of each student represents a signature to differentiate the student behavior.Therefore, instead of making the full search during distance computation with the whole training set, only a subset of these samples is computed.The moments are determined by exploiting two first-order moments D1 and D2, as shown mathematically in (1): where v is the length of the feature subset.S represents attribute j of sample i in a dataset.W is a mathematical representation chosen for a better separation control.The adopted weights in this research are: where j=0 … v-1, The descriptor of both training and testing sample is determined using the following mathematical (4): The proposed FKNN needs to determine the index value for each sample (i.e.student), the determined descriptor (Des) is converted into integer value within range [0, No_sub], where No_sub is the number of training subsets, the descriptor index value for each sample (Des_Index) is computed using the (5): In addition, the proposed FKNN needs to construct a data structure (DS) to warranty faster access to the samples.This data structure contains the identification number and descriptor index for all training samples.The samples of a data structure are arranged in ascending order according to samples' descriptor index.Therefore, all samples that have the same descriptor will form a class (i.e.training sample subset) in contiguous locations.The pre-classification of a training set into subsets is clarified by the following pseudo-code written as Algorithm 1.
The next step is to calculate the frequency for each descriptor index in the sorted data structure (DS) and set an array of pointer to indicate the start and end for each training subset.In such a way, the limitations After completing the task of sorting training samples, the matching process takes place by applying KNN.When samples of training subset are arranged at contiguous locations since they shared a similar descriptor index, as a result, each test sample is only compared with the specific training subset based on its descriptor index.Absolutely, this training subset has fewer samples than those found within the full training dataset.In addition, the best similar samples (in terms of their attributes) are most probably available in this training subset that has similar descriptor index.This led to a substantial reduction in running time of KNN and improves the accuracy of classification.The similarity measurement is based on the Euclidean distance between the test sample and samples of the training subset.The calculated distances are stored in a sorted ascending order array.If the distance has zero value, the label of the corresponding sample is considered as target class directly, otherwise, the k training sample is picked out and the target class of the new sample is determined by the use of the majority voting concept.The following pseudo-code (algorithm 3) explains the steps involved in applying the proposed (FKNN) for test samples:

RESULTS AND ANALYSIS
The experiments and the application system are performed based on visual studio.netC# 2015.The evaluation of the proposed method is performed using holdout validation, which splits datasets into two sets: 70% training and 30% test.Accuracy (ACC) is considered to measures the degree to which the instances correctly classified by the machine learning algorithm in proportion to the entire tested instances [22].As mentioned earlier, the main aim of this work is the prediction of student performance.For this purpose, the target class label is formulated for each dataset, which can be either "Pass" or "Fail".There are three averages of G1, G2, and G3 in the UCI dataset with values from 0 to 20.Therefore, if the student has a grade equal to or greater than 10, it should be classified under the "Pass" label, otherwise, it should be classified as a "Fail" label.In Iraqi dataset, grade values are within range of (0-100).If the student has a grade equal or higher than 50, it should be defined within the "Pass" label, otherwise is classified as "Fail" student.
The students' performance of the UCI datasets is predicted based on final semester grades (G3) as the objective class.The Iraqi dataset prediction of the target class is done using the second-semester average (Avg2).In this work; for the purpose of comparing results among datasets, certain parameters must be established such as the number of descriptor classes (i.e. a number of bins) which set to a value of five and the value of K considered to be three nearest neighbors.
In the perspective of traditional KNN issues, the proposed FKNN has proved that it runs faster for all test samples than traditional KNN since FKNN requires a smaller number of comparisons based on the distance calculation of each new sample information from a subset of training data containing the same descriptor index as the new sample.This can also reduce memory requirements significantly.In contrast to traditional KNN, it is being slow because of the dependency on the exhaustive search of each new sample with all training data and requires more memory capacity to store distances of whole training samples.Figure 1 indicates that the running time of the proposed method is improved compared to the traditional KNN time.
Comparison a common way to measure the processing effect is to compare the outcome of interest before processing with that after processing.The percentage change measures an item's change in value relative to its original value.Suppose x is the baseline value, y is the post-processing value.The Percentage change can be calculated using (6) [23]: PC=((X-Y)/X) *100 (6) Table 1 summarizes the percentage change of running time based on the results shown in Figure 1.It can be seen that the proposed FKNN reduces the time complexity of the traditional KNN by (90.25 %), (87.53 %), and (75.4 %) for Por, Math, and Iraq, respectively.
The performance of the proposed FKNN achieves better classification accuracy than traditional KNN for all datasets.This is due to that the proposed FKNN relies on the weighted moment descriptor samples to construct training subsets that have higher class discriminatory information.This can lead to getting the best  2 the proposed method obtains the highest accuracy of 100% for Iraqi student performance dataset.In addition, it can be seen that the proposed method is able to enhance classification accuracy for final semester grade prediction (G3) by obtaining the percentage change in accuracy of (36.3%), (23.7%), and (20%), for Por, Math and Iraq datasets, respectively.
Table 3 shows a comparison of the proposed FKNN with the research work of [24].This research uses Por dataset from UCI to predict student performance based on eight features G2, G1, failures, higher, Medu, school, studytime, Fedu.In addition, a comparison of the proposed FKNN with the research work of [25].This research uses Math dataset from UCI to predict student performance based on 19 features including the class attribute: sex, famsize, address, pstatus, medu, fedu, mjob, fjob, traveltime, studytime, schoolsup, higher, internet, romantic, freetime, Dalc, Walc, health, success.It is clear that the proposed DDT surpass all methods utilized in these researches for two UCI (Por and Math) datasets.

CONCLUSIONS
This study presented the FKNN algorithm which combines the sample indexing mechanism with KNN to deal with the major problems of the traditional KNN.The computational overhead, memory requirement, multidimensionality (the number of samples) and misclassification problems were substantially reduced due to the pre-classification of training data based on the descriptors used by the selective search strategy.The classification accuracy was enhanced using the proposed FKNN method since the training sample grouped according to their similarity.The results showed a significant enhancement in accuracy with the highest increase reached to (36.3%) and improved the computation time of KNN with the highest time reduction reach to 90.25% for UCI.student-pro.csvdataset.The adopted experiments confirmed that the proposed FKNN outperformed the traditional KNN for all educational data sets.Therefore, the proposed algorithm was very useful for the real-time system such as e learning environment and could be used for larger datasets.


ISSN: 1693-6930 TELKOMNIKA Telecommun Comput El Control, Vol. 18, No. 4, August 2020: 1777 -1783 1780 of each subset are indicated by pointers that act as leading signs to reach the intentional class immediately.The steps for building an array of pointers are illustrated in the following pseudo-code, presented as Algorithm 2. Algorithm 1. Training Set Pre-Classification Input: Training sample as a matrix [# students, # attributes] Output: Sorted Data Structure contains samples classifying according to their descriptor index field value.Define DS as a data structure which is an array of records contains two elements student descriptor index and his positioning (or, identifier) in training set.Set the number of descriptor classes to No_sub.For each index j of feature vector length // Calculate weights in the range of feature vector length.Begin Compute w1[j] using equation 2 Compute w2[j] using equation 3 End For each student i in the Training set Begin For each attribute j in the feature vector Compute D1 and D2 based on equation 1. Compute Descriptor of student i (Des) by using equation 4 Compute Sample Descriptor Index (Des_Index) using equation 5 Set DS[i].Index=Des_Index Set DS[i].Identifier=i End Sort elements of data structure (DS) according to descriptor field.Return DS Algorithm 2. Pointers Input: Sorted data structure (DS) of samples' descriptors and identifiers Output: an array of pointers Pointer[#No_sub] Define Freq [#No_sub] as an array of integer hold the occurrences of descriptor index (Des_Index) in DS.For each student i in training dataset Begin Set X=DS[i].Index Increment Freq[X] by one End Set Pointer [0] =0 For each value n in No_sub Set Pointer[n]=Pointer[n-1] + Freq[n-1] Return Pointer


ISSN: 1693-6930 TELKOMNIKA Telecommun Comput El Control, Vol. 18, No. 4, August 2020: 1777 -1783 1782 matching distance among new and training samples.Via this selective scheme, the misclassification problem of traditional KNN can be significantly overcome.Referring to Table

Figure 1 .
Figure 1.The comparison of the proposed FKNN and traditional KNN running time for final grade prediction

Table 1 .
Percentage change in running time of proposed FKNN

Table 2 .
The accuracy improvement of the proposed method vs. traditional KNN for final grade prediction

Table 3 .
Accuracy comparison of our proposed DDT and other methods for UCI datasets