CT-FC: more Comprehensive Traversal Focused Crawler

profesional


Introduction
The rapid growth of information makes general search engine more difficult to provide services effectively.On general search engine, users must select and open the document first before determine whether the information list is relevant to their needs.This job can be timeconsuming and tedious for the user [1].Instead of general search engine, several professional organizations need domain search engine to meet their information needs.This domain search engine indexes only documents relevant to specific topics.To index the information domain search engine uses focused crawler as an agent to traverses WWW and downloads documents relevant to the specified topics.
Focused crawler must determine which link to visit to maximize relevant documents obtained and avoid links that are not important to minimize irrelevant documents.A good strategy is needed to determine the seed pages in an effective and predictions on which ones deserve a link is followed to obtain relevant documents before the actual download [2].For now conventional focused crawler can only reaches relevant documents that are connected by downloaded documents out-links.Actually there many characteristics of relevant documents hyperlink structure in WWW and some relevant documents could not be obtained by other relevant document out-links.Thus, need a new strategy to ovoid locality search trap in focused crawling [2].

Related Works
Web crawler is a program that utilizes the Web graph structure to move from one document to others in order to obtain Web documents and add them or their representation to a local storage media.Thus, the crawling process can be viewed as a graph search problem.
In its simplest form, search process of crawling system starts from a seed URLs and then by using downloaded document out-links to visit other URLs.This process is repeated by increasing out-links that are generated from new documents.The process will end if the number of documents considered is sufficient or meets certain criteria.In general, the infrastructure of the crawling process shown in Figure 1(a) [3].While building a general search engine, many problems will be encounter such as the need of huge resources particularly in terms of providing storage and bandwidth for crawling process and services to various domain users.To index information related to specified topics, focused crawler has a smart component to determine the search strategy on the web graph.This component leads to the graphs that are relevant to a particular topic.Figure 1(b) shows focused crawling infrastructure.Qin [4] categorizes the proposed focused crawling methods into two parts: Web analysis algorithm and Web search algorithm.Web analysis algorithm used to assess the relevance and quality of Web documents and Web search algorithms used to determine the optimal order in which the target URLs are visited.
First generation crawler based on traditional graph algorithms, such as breadth-first search or depth-first search [5].From set of seed URLs, the algorithm follows hyperlinks leading to other documents recursively.The main objective of crawling system is to search over the Web and download all documents found.Thus, the material contained in the document content will be least observed.
Instead of general crawler, a focused crawler must obtain Web documents relevant to a particular topic efficiently.Generally, researchers proposed Web content-based search strategy.This strategy is derivation of text retrieval that already has a mature theoretical base.Salton [6] proposed a vector space model that represents each document or query by a vector.In this model, each term represents a single dimension and the weight that accompany to each dimension represents the term contribution related to document material.Furthermore, the lexical representation can infer the semantic meaning of a document by using lexical topology.Based on the model Rijsbergen [7] provided a hypothesis, i.e.: a document with the same vector space to a relevant document will have a high probability of relevance.Search engines have used the lexical metric traditionally to rank any documents according to their similarity to query [8].One direction of Web document hyperlinks (out-links) make focused crawler search limits to top-down strategy, called forward crawling.Actually many Web documents are organized in tree structure.When the focused crawler is in a leaf position, this makes serious obstacle to find highly structured or sibling/spouse relevant documents.For example, when focused crawler find a computer science researcher main page from a hyperlink of paper list at a conference site, it needs a good strategy for crawling other members' documents of computer science department.Without hyperlink to the other department members' documents explicitly, conventional focused crawler will not be able to move up to the department main page and to the other members' documents.This condition makes conventional focused crawling recall low.

Conventional Focused Crawling Precision and Recall Trade-Off
There is a trade-off between precision and recall of conventional focused crawling.Higher conventional focused crawler result precision, make the recall getting lower.Figure 2(a) shows focused crawling process that ignores irrelevant documents.Thus, the crawling result has low precision but high recall.On the other hands Figure 2(b) shows focused crawling process that avoid irrelevant documents can increase precision and declining the recall.This is because of WWW characteristics, which permits many relevant documents, connected to the others indirectly.
(a) (b) Figure 2(a) When the goal of crawling system is just higher recall, it will download all of documents both relevant and irrelevant ones until all relevant documents are downloaded; (b) Focused crawling system cuts the link through irrelevant documents to maintain precision Many relevant documents also connected to the others through co-citation documents or by in-link of downloaded documents that make conventional focused crawler has low recall (Figure 3).The following chapter describes more detail about WWW characteristics.

WWW Structure Characteristics
In general, Web graph characteristics identified by previous researchers categorized into four quadrants of Cartesians diagram (Figure 4).Horizontal axis describes connecting type between relevant documents (directly/indirectly) and vertical axis describes search direction that must be done to obtain the relevant documents (forward/backward).Figure 4. Four WWW characteristics quadrants Relevant documents in quadrant I which are connected directly and in a forward direction search, have strong connected characteristic, i.e. there are connections from one document to others and there is a cycle in the inter-links graph.Quadrant II (connected indirectly and in a forward direction search) contains relevant documents, which have indirectly connected characteristic, i.e. connected through one or several irrelevant documents [9], [10], [11].Relevant documents in quadrant III connected directly through in-links of downloaded documents.Quadrant IV contains relevant documents, which are connected via co-citation documents [12], [13].

Focused Crawling System
Focused crawlers considered as a Web information searcher agent.User query initiates the information search.The user query expressed in the form of seed URLs relevant to specified topic.Afterwards, focused crawler downloads documents related to the seed URLs and maps them to an appropriate concept.The concept mapping is to understand queries concept given by user and limits the Web retrieval fields.
An ontology can be useful to know the relationships between concepts.Combination of query concept and the available general ontology used to set up local ontology of specified topic.If a concept has no link to the query concept then the concept should be removed from the local ontology.Finally, each lexicon related to each concept in the local ontology can be used as a reference to assess the documents' relevance.Figure 5 shows the focused crawling framework.
Measuring parameters and measurement criteria may be used to assess system optimality.These parameters and the measurement criteria may influence to system design.Therefore, the following sub-chapter will discuss the measuring parameters and measurement criteria before discussing focused crawling system in details.

Increasing Focused Crawling Precision
Focused crawler's operation is initialized by a given seed URLs.The operation runs according to downloaded document's relevance assessment and the obtained links.The relevance assessment is based on query as an abstraction of seed URLs.
To increase precision, focused crawler uses semantic analysis to asses document relevance.The semantic analysis is to obtain the desired topic concept (Figure 6(a)).As an illustration, when a user wants a pet topic.Let documents related to the seed URLs talk about cat babies.Keyword 'cat baby' is acquired at the pre-process.In syntactic analysis, the result documents may contain words 'baby' and/or 'cat'.Based on this syntactic analysis results, documents that contain the word 'baby' will be set true even though the document discusses about a human baby.Meanwhile, focused crawler will reject a document containing word 'dog' because it does not contain the word 'baby' or 'cat'.
There are two disadvantages in syntactic search: (1) By taking the documents containing the word 'baby' without considering what kind of baby will make more irrelevant documents are downloaded.This condition will reduce precision.(2) When crawler rejects any documents which are not contain the words 'baby' or 'cat' even though the document is in the same concept, it will make the recall becomes low.
If a query has one major concept (hereinafter referred to as topic), then focused crawling result has high precision because the query does not have multi meanings (polysemy).The greater number of concepts related to the query, implies that precision decreases exponentially (prediction accuracy is 1/|c|) (Figure 6(b)).Thus, to increase the precision, the query must be mapped onto exactly one major concept or topic (Q:C = 1:1).

Increasing Focused Crawling Recall
To increase focused crawling recall, local ontology of query concept has to generate after determining the query concepts.Local ontology is generated by the main query concepts substitution into available general ontology and trim the related concepts of the same topic.Figure 7 illustrates the substitution process.Lexicon list and its combination derived from the local ontology to assess document's relevance.The derivative results also include synonyms of the main topics lexicon.The completeness of the topic's concepts and synonyms knowledge may influences the increase of the focused crawling recall.Based on the WWW characteristics, beside the two variables above (related concepts and lexicon synonyms), completeness of exploration spaces also influence the focused crawling recall (Figure 8).Nowadays, focused crawler conventional just explores at quadrant I and II because the search is done just in forward direction.Quadrant III and IV have not been explored by conventional focused crawler and this study proposes a method to explore the quadrants comprehensively.

Determinants of Precision and Recall
The description of chapter 5.1 and 5.2 can be concluded that there is one major variable that influences focused crawling precision (number of query concepts) and three main variables that influence focused crawling recall (Table 1).Focused crawler has high precision when the query only relates to one main concept (Q:C = 1:1).More concepts of the query makes precision decreases exponentially Q:C = 1:|c|.2), the more complete search spaces can be explored, the higher focused crawling recall will be.

Focused Crawling Exploration
There are four types of neighboring documents in Web crawling search space, i.e.: parent document, child document, sibling document, and spouse documents shown in Figure 9 [14].
Generally, focused crawler can only explore relevant documents in quadrant I and II.This is because focused crawler only use downloaded documents' out-links as traversal guidance.Focused crawler can explores relevant documents located in quadrant I because there exist out-links from one relevant document to others in the strong connected characteristic.Formula (3) is an algorithm to explore relevant documents in quadrant I.
Reachable(FC I ) = {U(q) | while Score(q)=1 do p= q, SUCC*(p)} (3) Several studies have proven the existence of relevant documents that are connected through one or more (maximum of twelve) irrelevant document.Therefore, several focused crawling methods are not cut off directly the routes through irrelevant documents but reducing the weight of encountered out-links.The farther out-link from relevant documents, the less relevance weight will be.Algorithm to explore relevant documents in quadrant II is in formula ( 4) Reachable(FC II ) = {U(q) | while 0<Score(q)<1 and d(q)<12 do p= q, SUCC*(p)} (4) In order to increase focused crawling recall, to explore relevant documents in quadrant III and IV may not be done just by utilizing the downloaded documents' out-links, but has to utilize backlinks of potential downloaded documents, too.Relevant documents in quadrant III are analogue to spouse document in Figure 9.When the downloaded relevant documents point to the same child, then the child documents can be regarded as an authority.If the child document is an authority, then all spouse documents predicted as candidates of relevant documents and must be downloaded.To clarify this, see the algorithm below to find spouse documents, Formula ( 5) is an algorithm to explore relevant documents in quadrant III.
Reachable(FC III ) = {U(q) | r=SUCC(p) and s=SUCC(p); if Score(r)=1 or Score(s)=1 then (SUCC -1 )*(p)} Similar to quadrant III, relevant documents of quadrant IV analogue to sibling documents in Figure 9.When the downloaded relevant documents are pointed by the same parent, then the parent documents can be regarded as a hub.If the parent document is a hub, then all sibling documents predicted as candidates of relevant documents and must be downloaded.The algorithm below is to find sibling documents, if y=SUCC -1 (p) and y=SUCC -1 Formula ( 6) is an algorithm to explore relevant documents in quadrant IV. 197 Formula ( 7) is an algorithm to reach relevant documents which are connected to each other either directly or indirectly and connected through out-links or backlinks.Whereas to reach disconnected relevant documents, focused crawler utilizes the ontology to maximize the result.

Result
Experiments have been carried out crawling process with CT-FC strategy on several topics, including the topic of "algorithm".There are 1714 documents which are relevant to the topic "algorithm" in DMOZ.Many relevant URLs were taken at random as much as 1 to 80 URLs used as seed URLs and the rest are considered as target documents.Conventional strategy of focused crawling without using in-link information is done for comparison.Figure 10 shows the comparison of recall range average of CT-FC and conventional focused crawler.The same variation of seed URLs as well as CT-FC is given to the conventional focused crawling.CT-FC gives a significant increasing from conventional focused crawler recall.With a small seed URLs, conventional focused crawling produces recall so far just about 0.5 but with CT-FC, it quickly generates recall above 0.7 and continues increasing rapidly when the number of seed URLs added.

Conclusion
With the forward and backward crawling approach, focused crawler can increase the exploration capability and recall performance.With this ability, the constraints faced by conventional focused crawler associated with the Web structure characteristics can be resolved.This can be proved by the high value of crawling recall although just a small number of seed URLs is given.
This study proves the relevance support from a relevant document for sibling documents through co-citation, and to spouse documents through co-reference.Based on the result of the experiment, forward and backward crawling approach make focused crawler becomes more stable, (not sensitive to the amount and quality of seed URLs).Bibliometric concepts also supports CT-FC to have good performance, especially in precision, recall and stability.

Figure 1 .
Figure 1.Many relevant documents that are connected through (a) co-citation documents and (b) co-referenced documents

Figure 3 .
Figure 3. Conventional focused crawler may not reach relevant documents connected through co-citations or in-links of downloaded documents

Figure 6 .Figure 7 .Figure 8 .
Figure 6.(a) Query to concept mapping.If the mapping produces more than one concept, then there is a polysemy or ambiguity in query meaning; (b) Precision decreases in accordance with the number of query concepts

Table 1 .
Effect of Q (query), C (concept), L (lexicon) and K (quadrant search) on precision and recall of focused crawling system Recall \ Precision Low High High Low Focused crawling recall depends on variables of: (1) Completeness of concepts knowledge related to the topic (local ontology -O L ).More concepts of the local ontology (O L :C = 1:|c|) will increase focused crawling recall; (2) Completeness of derivative concepts' lexicon synonyms contained in the local ontology, because of there are many lexicons which have similar meaning (synonymous).The more synonyms recognized the better increase focused crawling recall; and (3) Completeness of exploration spaces to obtain relevant documents (P D (K)).As seem variable (1) and (

Figure 9 .
Figure 9. Four kinds of document neighbors

Figure 10 .
Figure 10.Average recall comparison of more Comprehensive Traversal Focused Crawler (CT-FC) and Conventional Focused Crawler (CFC)