Maimunah S.,Surabaya Adhi Tama Institute of Technology |
Widyantoro D.H.,Bandung Institute of Technology |
Kuspriyanto,Bandung Institute of Technology |
Sastramihardja H.S.,Bandung Institute of Technology
Proceedings of the 2011 International Conference on Electrical Engineering and Informatics, ICEEI 2011 | Year: 2011
Focused crawler is an agent to index information according to specific topic. To traverse WWW, focused crawler makes a prediction of hyperlink's visiting priority in order to download relevant documents as maximum as possible and to minimize downloaded irrelevant documents. Many researchers have proposed methods to improve focused crawling precision by minimizing irrelevant documents. However there is a precision and recall trade-off. More precision the results make less recall. This research has studied on conventional focused crawling search strategy (forward crawling) and Web documents structure. The result shows the low recall of conventional focused crawling is caused by some structural characteristics of WWW. Therefore, this research proposes a new strategy of focused crawler. The new strategy is a combination of bidirectional (forward and backward) crawling and bibliometric concepts (co-citation & co-reference). Bidirectional crawling is to improve the exploration and co-citation & co-reference concepts are to control the focusing. With this new strategy, focused crawler can obtain relevant documents that are connected through co-citations or relevant communities that act connected through co-references. Based on experiments that have been carried out, the results show that focused crawler with this new strategy, named CT-FC (more Comprehensive Traversal Focused Crawler) has better exploration capability so that recall increases and precision can remain high. © 2011 IEEE.