Focused Crawler Optimization Using Genetic Algorithm

Banu Wirawan Yohanes, Handoko Handoko, Hartanto Kusuma Wardana

Abstract


As the size of the Web continues to grow, searching it for useful information has become more difficult. Focused crawler intends to explore the Web conform to a specific topic. This paper discusses the problems caused by local searching algorithms. Crawler can be trapped within a limited Web community and overlook suitable Web pages outside its track. A genetic algorithm as a global searching algorithm is modified to address the problems. The genetic algorithm is used to optimize Web crawling and to select more suitable Web pages to be fetched by the crawler. Several evaluation experiments are conducted to examine the effectiveness of the approach. The crawler delivers collections consist of 3396 Web pages from 5390 links which had been visited, or filtering rate of Roulette-Wheel selection at 63% and precision level at 93% in 5 different categories. The result showed that the utilization of genetic algorithm had empowered focused crawler to traverse the Web comprehensively, despite it relatively small collections. Furthermore, it brought up a great potential for building an exemplary collections compared to traditional focused crawling methods.


Full Text:

PDF

References


Chakrabarti S, van den Berg M, Dom B. Focused Crawling: a New Approach to Topic-Specific Web Resource Discovery. Proceedings of the 8th International WWW Conference. Toronto, Canada. 1999: 545-562.

Chau M, Chen H. Comparison of Three Vertical Search Spiders. IEEE Computer. 2003; 36(5): 56-62.

Salton G. Another Look at Automatic Text-retrieval Systems. Communications of the ACM. 1986; 29(7): 648-656.

Bergmark D. Collection Synthesis. Proceedings of JCDL 2002. Portland, Oregon, USA. 2002.

Kitsuregawa M, Toyoda M, Pramudiono I. Web Community Mining and Web Log Mining: Comodity Cluster Based Execution. Proceedings of the 13th Australasian Database Conference. Melbourne, Australia. 2002.

Brin S, Page L. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems. 1998; 30: 1-7.

Kleinberg JM. Authoritative Sources in a Hyperlinked Environment. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. San Francisco, California, USA. 1998: 668-677.

Flake GW, Lawrence S, lee Giles C. Efficient Identification of Web Communities. Proceedings of the 6th ACM SIGKDD. Boston, Massachusetts, USA. 2000.

McCallum A, Nigam K, Rennie J, Seymore K. A Machine Learning Approach to Building Domain-Specific Search Engines. Proceedings the International Joint Conference on Artificial Intelligence (IJCAI-99). 1999: 662-667.

Chen H, Chung Y, Ramsey M, Yang C. A Smart Itsy-Bitsy Spider for the Web. JASIS. 1998; 49(7):604-618.

Dean J, Henzinger MR. Finding Related Pages in the World Wide Web. Proceedings of the 8th International WWW Conference. Toronto, Canada. 1999.

Gibson D, Kleinberg J, Raghavan P. Inferring Web Communities from Link Topology. Proceedings of the 9th ACM Conference on Hypertext and Hypermedia. Pittsburgh, Pennsylvania, USA. 1998.

Toyoda M, Kitsuregawa M. Creating a Web Community Chart for Navigating Related Communities. Proceedings of ACM Conference on Hypertext and Hypermedia. Århus, Denmark. 2001.

Bergmark D, Lagoze C, Sbityakov A. Focused Crawls, Tunneling, and Digital Libraries. Proceedings of the 6th ECDL. Rome, Italy. 2002.

Lawrence S, lee Giles C. Searching the World Wide Web. Science. 1998; 280(5360): 98.

Shokouhi M, Chubak P, Raeesy Z. Enhancing Focused Crawling with Genetic Algorithms. Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’05). 2005; 2: 503-508.

Ibrahim SNA, Selamat A, Selamat MdH. Scalable E-business Social Network Using MultiCrawler Agent. Proceedings of the International Conference on Computer and Communication Engineering. Kuala Lumpur, Malaysia. 2008.

Sen S, Roy P, Chakrabarti A, Sengupta S. Generator Contribution Based Congestion Management Using Multiobjective Genetic Algorithm. TELKOMNIKA Indonesian Journal of Electrical Engineering. 2011; 9(1): 1-8.

Bhaskar MM, Benerji M, Sydulu M. A Hybrid Genetic Algorithm Approach for Optimal Power Flow. TELKOMNIKA Indonesian Journal of Electrical Engineering. 2011; 9(2): 211-216.

Tahami M, Nademi H, Rezaei M. Maximum Torque per Ampere Control of PMSM using Genetic Algorithm. TELKOMNIKA Indonesian Journal of Electrical Engineering. 2011; 9(2): 237-244.

Ghozia A, Sorour H, Aboshosha A. Improved Focused Crawling Using Bayesian Object Based Approach. 25th National Radio Science Conference (NRSC 2008). Egypt. 2008.




DOI: http://doi.org/10.12928/telkomnika.v9i3.730

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

TELKOMNIKA Telecommunication, Computing, Electronics and Control
ISSN: 1693-6930, e-ISSN: 2302-9293
Universitas Ahmad Dahlan, 4th Campus
Jl. Ringroad Selatan, Kragilan, Tamanan, Banguntapan, Bantul, Yogyakarta, Indonesia 55191
Phone: +62 (274) 563515, 511830, 379418, 371120
Fax: +62 274 564604

View TELKOMNIKA Stats