Streamed Sampling on Dynamic data as Support for Classification Model

Astried Silvanie, Taufik Djatna, Heru Sukoco

Abstract


Data mining process on dynamically changing data have several problems, such as unknown data size and changing of class distribution. Random sampling method commonly applied for extracting general synopsis from very large database. In this research, Vitter’s reservoir algorithm is used to retrieve k records of data from the database and put into the sample. Sample is used as input for classification task in data mining. Sample type is backing sample and it saved as table contains value of id, priority and timestamp. Priority indicates the probability of how long data retained in the sample. Kullback-Leibler divergence applied to measure the similarity between database and sample distribution. Result of this research is showed that continuously taken samples randomly is possible when transaction occurs. Kullback-Leibler divergence with interval from 0 to 0.0001, is a very good measure to maintain similar class distribution between database and sample. Sample results are always up to date on new transactions with similar class distribution. Classifier built from balance class distribution showed to have better performance than from imbalance one.


Full Text:

PDF

References


Braverman V, Ostrovsky R, Zainolo C. Optimal Sampling from Sliding Window. Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 2009: 147-156.

Byung-Hoon P, George O, Nagiza F S. Sampling streaming data with replacement. Computational Statistics & Data Analysis. 2007; 52(2): 750-762.

Ferrandiz S, Boulle M. Supervised selection of dynamic features, with an application to tellecomunication data preparation. Proceedings of the 6th Industrial Conference on Data Mining conference on Advances in Data Mining: applications in Medicine, Web Mining, Marketing, Image and Signal Mining. 2006: 239-249.

Gemulla R, Lehner W. Sampling time-based sliding windows in bounded space. In Proc. of the 2008 ACMSIGMOD Intl. Conf. on Management of Data. 2008: 379–392.

Gibbons P B, Matias Y, Poosala V. Fast incremental maintenance of approximate Histograms. In Proc. VLDB. 1997: 466–475.

Hoi S C H, Wang J. Zhao P. Rong J. Online Feature Selection for Mining Big Data. Big Mine’12, Proceeding of the 1st international workshop on big data, streams, heterogenous sources. 2012: 93-100.

Nasereddin H H. O. Stream Data Mining. International Journal of Web Applications (IJWA). 2009: 1(43).

V. Jefrey S. Random Sampling with a Reservoir. ACM Transactions on Mathematical Software (TOMS). 1985: 11(1): 37-57.

Wang Y, Liui S, Feng J, Zhou L. Mining naturally smooth evolution of clusters from dynamic data. SIAM International Conference on Data Mining - SDM. 2007.

Chawla, N. Data Mining for Imbalanced Datasets: An Overview. Data Mining and Knowledge Discovery Handbook. 2010: 875-886.

Haibo H, Edwardo A G. Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering. 2009; 21(9).

Jason Van Hulse, Taghi M Khoshgoftaar, Amri Napolitano. Experimental Perspective on Learning from Imbalanced Data. ICML’07 Proceedings of the 24th International Conference on Machine Learning. 2007: 935-942.

Tang L, Liu H. Bias analysis in text classification for highly skewed data. ICDM ’05: Proceedings of the Fifth IEEE International Conference on Data Mining. 2005:781–784.

Prati R C, Batista G E A P A, Monard MC. A Study with Class Imbalance and Random Sampling for a Decision Tree Learning System. Artificial Intelligence in Theory and Practice II. 2008; 276: 131-140.

Xinjian Guo, Yilong Yin, Cailing Dong, Gongping Yang, Guangtong Zhou. On the Class Imbalance Problem. ICNC’08 Fourth International Conference. 2008; 4: 192-201.

Efren A, Tuna E. On some properties of goodness of fit measures based on statistical entropy. IJRRAS. 2013: 192-205.




DOI: http://doi.org/10.12928/telkomnika.v11i4.1210

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

TELKOMNIKA Telecommunication, Computing, Electronics and Control
ISSN: 1693-6930, e-ISSN: 2302-9293
Universitas Ahmad Dahlan, 4th Campus
Jl. Ringroad Selatan, Kragilan, Tamanan, Banguntapan, Bantul, Yogyakarta, Indonesia 55191
Phone: +62 (274) 563515, 511830, 379418, 371120
Fax: +62 274 564604

View TELKOMNIKA Stats