Comparison analysis of Bangla news articles classification using support vector machine and logistic regression

In the information age, Bangla news articles on the internet are fast-growing. For organizing, every news site has a particular structure and categorization. News article classification is a method to determine a document’s classification based on various predefined categories. This research discusses the classification of Bangla news articles on the online platform and tries to make constructive comparison using several classification algorithms. For Bangla news articles classification, term frequency-inverse document frequency (TF-IDF) weighting and count vectorizer have been used as a feature extraction process


INTRODUCTION
Text data is the most comprehensive source of information but due to its unstructured nature, it is difficult and time consuming to draw insights from it.Advancement in machine learning (ML) and natural language processing (NLP) are making it easier to analyze text data through text classification.Text classification techniques are used to organize, structure and categorize text data for analysis of sentiment, topic labelling, spam detection, intent detection and so on [1].This study applies text classification techniques to Bangla news articles as the news articles are the most common form of text online.Many studies have been conducted on text classification, and different classification techniques such as rules-based method, decision trees, k-nearest neighbors (KNN), naive Bayes, logistic regression, support vector machines (SVM), neural networks (NN) and so on have been developed [2].Studies in the literature of various research papers have shown that researchers have focused on classifying texts in different languages, such as English and Arabic and so on [3], [4] however, the amount of analysis on the Bengali language is much less.In [5] have used machine learning algorithms to categorize Bangla newspaper articles into five distinct groups.They used logistic regression, SVM, and multi-layer neural network to do so.Multi-layer dense neural network approach has a TELKOMNIKA Telecommun Comput El Control ❒ 585 accuracy of 95.50% which is higher compared to other models.There is a representation [6] of the behavior of least-squares SVMs, twin SVMs, and least-squares twin SVM (LS-TWSVM) classifiers on the news data is shown to handle multi-category data.And their performance evaluation showed that LS-TWSVM is the best of all three with 92.96% accuracy.Maisha et al. [7] perform sentiment analysis on Bangla news using "pipeline" class along with six state-of-the-art supervised ML algorithms which includes decision tree (DT), multinomial naive Bayes (MNB), k-nearest neighbor (KNN), logistic regression (LR), random forest (RF) and lagrangian support vector machine (LSVM).Random forest algorithm out stands all other algorithms securing 98% accuracy in percentage split method.This methodology categorizes the content either as positive or negative.Rabbinnov and Kobilov [8] used SVM, decision tree classifier, random forest, LR, and MNB among six other machine-learning algorithms to conduct multi-class text categorization of internet Uzbek news articles.
For SVM, they used radial basis function (RBF SVM) which gave the best accuracy (86.88%) performance of other classifiers.Several supervised machine learning as well as deep learning algorithms for categorizing Bengali news documents are discussed in [9].A never method for classifying Bangla textual content has been developed by [10].The deep learning recurrent neural networks (RNN) -based attention layer and the RNN with BiLSTM achieved accuracy rates of 97.72% and 86.56%, respectively.In [11] proposed a model structure named the DCLSTM-MLP model for the categorization of news text documents, an idea of a customized algorithm that combines deep learning algorithms such as long term short memory (LSTM), convolutional neural network (CNN), and multi-layer perceptron (MLP).By applying this model structure they have tried to solve the problems of textual length, the complexity of extracting features from news content and categorizing news text effectively and achieved 94.82% accuracy.However, the main issue of their paper is that the number of samples is small, and the distribution of different types of news is unequal, resulting in the model's narrow effectiveness.
Kowsher et al. of [12] used several word embedding methods to incorporate the text from Bangla newspaper data, as well as machine learning algorithms to categorize the incorporated text.A deep recurrent neural Network was used to provide a new method of evaluating Bangla news articles in [13].The deep recurrent neural network featuring Bi-LSTM obtained 98.33% in Bengali text categorization, which is greater than previous well-known classification techniques.For classifying text or documents, many of supervised techniques are used such as KNN, naive Bayes (NB), DT, n-grams, neural networks (NNet).But according to previous research literature reviews SVM [14] is most frequently used classifier algorithm.In [15] proposed a news article classification model framework based on deep hybrid learning and compared it to traditional text classification to demonstrate the superiority of network news text classification, and it outperforms the standardized method for classifying news texts in terms of overall performance.In [16] showed a comparative analysis among DT, KNN, NB, and Rocchio's algorithm where their studies said that SVM outperforms better than all other classifiers.Aside from the English document, there has also been extensive research on other languages.In [17] focused on the classification of self-created Indonesian news corpus, which includes four separate categories and 472 Indonesian newspaper articles from a variety of sources.They produced five models with ten epoch each using 377 data for training and 95 testing data, with CNN having the highest accuracy rate at about 90.74 percent categorizing Indonesian news data.In [18] conducted on an Indonesian news corpus found that the combination of TF-IDF and MNB beats other classification models such as multivariate Bernoulli naive Bayes (BNB) and SVM with an accuracy of 85%.In terms of Arabic text classification, in [19] Arabic medical text documents are classified and authors used an rule-based classifier for classification of Arabic text and having an accuracy of 90.6%.They used three classification algorithm: majority voting, ordered decision list, and K-NN which are used to validate the model.Among all the couple works covered in the Bangla, Chy et al. [20] worked on Bangla language news classification and they used naive Bayes classifier.But the problem is that in the naive Bayes model if the testing data set contains a categorical variable of a category that was not part of the training data set, it will give it zero probability and be unable to make any predictions.Nahar et al. [21] demonstrated a comparative analysis utilizing naive Bayes classifier, SVM, and neural networks to filter Bangla sports and political news only on online networks from text data.Mandal and Sen [22] showed the comparison performance evaluation of the 4 supervised learning techniques: decision tree, naive Bayes, k-nearest neighbor, and SVM for the Bengali text categorization.They used TF-IDF to comprise a feature vector using 1000 web documents which contains 22,218 words.Close to our research work, by using TF-IDF as a feature selection approach and SVM as a classifier, they attained a classification accuracy of 92.57percent for 12  In this study, LR and SVM are used to classify the Bangla news articles as the study [23] shows that the logistic regression outperforms other techniques such as random forest and k nearest neighbours algorithm.It classifies the Bangla news articles and label them to a certain topic based on their contents.It extracts features from the corpus using the TF-IDF feature vector since it is enabled to the extraction of relevant features as well as removing common words.

METHOD
The aim of news classification seems to allocate categories as per the content of a news article.Preprocessing of Bangla news articles and feature set extraction are also needed prior to training and model construction for document classification, like English text classification.The overall Bangla news article classification process have been used in this experiment as illustrated in Figure 1.

Data collection
Despite the limited amount of the data, Hossain et al. [24] discovered that text-graph convolutional neural (GCN) accomplished better than GRU-LSTM, BiLSTM, Char-CNN, LSTM, and bidirectional encoder representations from transformers (BERT) in classifying online Bangla news.As so far, there is a scarcity of standard data set in the Bangla language, so a data set have been prepared by scraping the news articles from various electronic news sites such as 'https://www.prothomalo.com/'.At the time of scrapping, the articles were labeled with their categories.To assess the recommendation results, we have collected around 12.5 K labeled news articles consisting of 20 categories.Among them top 12 categories are considered for this research work which has mentioned in Table 1

Data preprocessing
Preprocessing is performed to minimize the noise in the text, which helps to increase the classifier's efficacy.The preprocessing steps clear the text data and preparing it for the artificial learning model.Then the tokenization is performed on text documents and break down the sequence of characters or words to get the features.Since text documents contain sentences, like a sequence of character or words, to get the features, it should be broken into tokens.Tokenization is the process to breakdown the text into sentences and then words.Tokenization is the process by which the text is divided into phrases and then words.After tokenization, different symbols such as !, x, ¿, ¡, $, %, and numbers, which are not very important for classification, are excluded.Stop-words in Bangla are also eliminated at the same period.In [25] is a list of stop-words used in this research.

Feature extraction
The general method of conversion of a collection of text documents to numerical feature vectors is vectorization.Count vectorizer transforms a text document array into a token counting matrix: contains token counts occurrences in each document.A sparse representation of the numbers is generated by this implementation.On the other hand, the purpose of using TF-IDF instead of the raw frequencies of a token's occurrence in a given document is to scale down the influence of tokens that occur in a given corpus very regularly.And thus empirically less informative characteristics that occur in a small fraction of the training corpus are removed.It is a method of data retrieval that weights the frequency of words (TF) and the inversed document frequency (IDF).Every word has it's own TF and IDF ratings, respectively.The TF-IDF weight is the product of a term's TF and IDF scores.The rarer the word and vice versa, the higher the TF-IDF score.

Classifiers
In this study, two common classifiers: the SVM classifier with a linear kernel and the logistic regression have been used.In text classification, SVM has been used effectively.And logistic regression is a statistical method of data analysis in which one or more variables are used to determine the result.

SVM
In essence, SVM is a supervised machine learning approach known as a binary classifier.In [25] uses hyper-plane to classify data into two types.Support vectors are closer to the-hyper-plane data points that influence the position and orientation of the hyper-plane.The support vector is used to optimize the perimeter of the classifier.By eliminating the support vectors, the hyper-plane's position will change.SVM recognizes the following sign function of equation [26] mathematically, where w is an n weighted vector in R n .By dividing the space R n into two half-spaces with the maximum margin, SVM finds the hyper-plane in (2).
Generally, SVM is a binary classifier, but the strategies like one-to-one and one-vs-rest can be used to expand into a multi-class classifier.In SVM, linear and radial basis function kernels are used to make decisions.Since most texts are linearly separable, the linear kernel is preferred for text classification.

Logistic regression
Logistic regression uses a logistic function to estimate probabilities for the relationship between one or more independent variables and the dependent variable of the categories.The logistic function equation is also known as the logistic curve which is a common "S" shaped curve described by the (3).The sigmoid curve is another name for the logistic curve.
Where, L is the curve's highest value, e the base of a natural logarithm (or euler's number), k is the logistic growth rate or curve steepness, v 0 is the sigmoid midpoint's x-value.

RESULTS AND DISCUSSION
The performance of SVM classifiers and logistic regrassion on our data-set is discussed in this section.A little analysis on the data-set is provided here.Later, the performance analysis of SVM and LR on the Bangla text data and comparison with some similar works are demonstrated in Table 4.

Dataset analysis
Among 12.5 thousand of articles with 20 categories we used 11770 articles of top 12 categories in this research work.Bar-chart for number of news articles of each categories are given in Figure 2. In Figure 3 generated word cloud based on TF-IDF and count vectorizer from total articles are given.This is generated before running the preprocessing on the data.

Performance measures
In-text categorization, a variety of evaluation metrics are employed.The precision, recall, and Fmeasure are among the most widely used performance measures considered in our experiments.Each category's precision, recall, and F-measure experiment are measured for SVM and LR.The F-measure is averaged to assess performance throughout categories.Micro average and macro average are the two types of average value and the macro average is used.Table 2 show how two classifiers performed on our data-set corpus in terms of precision, recall, and F-measure.On the other hand Table 3 shows comparison of accuracy, average precision, average recall and average F1-score result for SVM and LR classifier.It is observed that SVM with linear kernel achieves accuracy of 0.84 and LR achieves 0.81 from Table 3 and SVM outperforms LR in case of average precision, recall and F1-score.Though from Table 2 it is found that for some classes or categories LR outperforms SVM in case of individual precision, recall and F1-score.Figure 4 shows the confusion matrix for SVM classifier with the linear kernel and logistic regression respectively.Table 4 shows the comparison between the recent works using SVM classifier findings with the results of this research finding.The comparison shows that our research work utilizing the combination of TF-IDF and count vectorizer features in the SVM classifier achieves better accuracy than other recent works which also used SVM in their research works.Trigram TF-IDF 0.80 Rahman et al. [28] TF-IDF 0.82 Yeasmin et al. [5] TF-IDF 0.83

CONCLUSION
In the field of information systems, text categorization is a contentious issue.In this paper, we have used SVM and LR classifiers on own developed corpus and their performance is measured.The proposed methodology in this study advocates the assumption that the Bangla language can indeed be rightly classified using SVM and LR with limited resources.The outcome of the suggested approach is promising.Still, more precision could have attained.Good outcomes could also have obtained if we had been able to deal with all of the news categories and utilize multiple categorization methods to come with a constructive opinion.In the next, we'd like to add more categories and compare the results using different classification algorithms.

Figure 1 .
Figure 1.System architecture of the proposed work

Figure 2 .Figure 3 .
Figure 2. Numbers of the Bangla news articles for each category

Table 1 )
categories of Bangla text files.They used 3191 text samples per category where we have tried to use 11770 articles of top 12 categories (showed Comparison analysis of Bangla news articles classification using ... (Md Gulzar Hussain) from 20 different categories.And best of our knowledge they used only SVM where we have used SVM and logistic regression (LR) and LR is primarily used to classify observations into a discrete number of categories with rapid classification of unknown records.In this research paper, we have tried to show a comparative performance analysis using SVM and LR which can bring righteous path for future researchers who are willing to work with Bangla language in this era.

Table 1 .
. Category-wise count of the Bangla news articles

Table 2 .
SVM and LR classifier results of 12 categories

Table 3 .
Accuracy, average precision, average recall and average F1-score result of SVM and LR classifier

Table 4 .
Comparison between this experiment and recent similar researches using SVM