CT-FC: more Comprehensive Traversal Focused Crawler

Siti Maimunah, Husni S Sastramihardja, Dwi H Widyantoro, Kuspriyanto Kuspriyanto

Abstract


 In today’s world, people depend more on the WWW information, including professionals who have to analyze the data according their domain to maintain and improve their business. A data analysis would require information that is comprehensive and relevant to their domain. Focused crawler as a topical based Web indexer agent is used to meet this application’s information need. In order to increase the precision, focused crawler face the problem of low recall. The study on WWW hyperlink structure characteristics indicates that many Web documents are not strong connected but through co-citation & co-reference. Conventional focused crawler that uses forward crawling strategy could not visit the documents in these characteristics. This study proposes a more comprehensive traversal framework. As a proof, CT-FC (a focused crawler with the new traversal framework) ran on DMOZ data that is representative to WWW characteristics. The results show that this strategy can increase the recall significantly.


Full Text:

PDF

References


Chen Y. A Novel Hybrid focused crawling algorithm to build domain-specific collections. PhD thesis. Virginia - United States. Virginia Polytechnic Institute and State University; 2007.

Maimunah S, et al. Community Associations As A Knowledge Base To Improve Focused crawling Recall. 5th International Conference on Information Communication Technology and Systems(ICTS). Surabaya. 2009: 225–230.

Ali H. Self Ranking and Evaluation Approach for Focused Crawler Based on Multi-Agent System. The International Arab Journal of Information Technology. 2008; 5(2): 183–191.

Qin J, Zhou Y, Chau M. Building Domain-Specific Web Collections For Scientific Digital Libraries: A Meta-Search Enhanced Focused Crawling Method. 4th ACM/IEEE-CS Joint Conference on Digital Libraries. Tucson AZ USA. 2004: 135–141.

Heinonen O, Hatonen K, Klemettinen M. WWW Robots and Search Engines. Seminar on Mobile Code. Report TKO-C79. Helsinki University of Technology. Department of Computer Science. 1996.

Salton G, McGill M. An Introduction to Modern Information Retrieval. McGraw-Hill. New York. 1983.

Rijsbergen CJ. Information Retrieval. Butterworth. 1979.

Pinkerton B. Finding What People Want: Experiences with the WebCrawler. Proceedings of the Second International World Wide Web Conference. 1994.

Bergmark D, Lagoze C, Sbityakov A. Focused Crawls, Tunneling and Digital Libraries. Proc. of the 6th European Conference on Digital Libraries. Rome Italy. 2002.

Kumar R, et al. Trawlling the Web for Emerging Cyber-Communities. Proc. of 8th International World Wide Web Conference. Toronto Canada. 1999.

Kumar R, et al. Extracting Large-Scale Knowledge Bases from the Web. Proc. of the 25th International Conference on Very Large Data Bases Conference. Edinburgh Scotland UK. 1999a.

Toyoda M, Kitsuregawa M. Creating a Web Community Chart for Navigating Related Communities. Proceedings of ACM Conference on Hypertext and Hypermedia. Århus Denmark. 2001: 103–112.

Dean J, Henzinger MR. Finding Related Pages in the World Wide Web. Proceedings of the 8th International WWW Conference. Toronto, Canada. 1999: 1467–1479.

Qi X, Davison BD. Knowing a web page by the company it keeps. CIKM. 2006: 228–237.




DOI: http://doi.org/10.12928/telkomnika.v10i1.777

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

TELKOMNIKA Telecommunication, Computing, Electronics and Control
ISSN: 1693-6930, e-ISSN: 2302-9293
Universitas Ahmad Dahlan, 4th Campus
Jl. Ringroad Selatan, Kragilan, Tamanan, Banguntapan, Bantul, Yogyakarta, Indonesia 55191
Phone: +62 (274) 563515, 511830, 379418, 371120
Fax: +62 274 564604

View TELKOMNIKA Stats