Oversampling vs. undersampling in TF-IDF variations for imbalanced Indonesian short texts classification 
	I Nyoman Prayana Trisna, Ni Wayan Emmy Rosiana Dewi, Muhammad Alam Pasirulloh 
	
			
		Abstract 
		
		Even though it is considered a more traditional method compared to more modern algorithms, term frequency inversed document frequency (TF-IDF) nevertheless produces good results in a range of text mining tasks. This study assesses the effectiveness of several TF-IDF modifications for short text classification. Imbalanced datasets are another issue that is addressed in this research. To rectify the imbalanced issue, we integrate standard, log-scaled, and boolean TF-IDF in short text classification with undersampling and oversampling methods. Precision, recall, and f-measure metrics are used to evaluate each experiment. The best result is obtained when applying boolean TF-IDF with the oversampling method. Oversampling methods outperform the undersampling methods in every experiment, although there are some cases where experiments with undersampling methods are considerable. Additionally, our conducted study reveals that employing modified TF-IDF, such as boolean or log-scaled versions, provides greater advantages to classification performance, particularly in handling imbalanced datasets, when compared to solely relying on the standard TF-IDF approach.
		
		 
	
			
		Keywords 
		
		Bahasa Indonesia; imbalanced dataset; oversampling method; short-text classification; term frequency inversed document frequency; undersampling method;
		
		 
	
				
			
	
	
							
		
		DOI: 
http://doi.org/10.12928/telkomnika.v23i2.26510 	
Refbacks 
				There are currently no refbacks. 
	 
				
		This work is licensed under a 
Creative Commons Attribution-ShareAlike 4.0 International License .
	
TELKOMNIKA Telecommunication, Computing, Electronics and Control 1693-6930 , e-ISSN: 2302-9293 Universitas Ahmad Dahlan , 4th Campus+62  274 564604
<div class="statcounter"><a title="Web Analytics" href="http://statcounter.com/" target="_blank"><img class="statcounter" src="//c.statcounter.com/10241713/0/0b6069be/0/" alt="Web Analytics"></a></div>  View TELKOMNIKA Stats