1645-9911

S1645-99112008000100018

00 06 2008

9 337 369

Satisfying Information Needs on the Web: a Survey of Web Information Retrieval^*

Nuno Filipe Escudeiro^[1]•, Alípio Mário Jorge^[2]•

(recebido em 20 de Março de 2008; aceite em 22 de Abril de 2008)

Resumo. Desde muito cedo que a espécie Humana sentiu a necessidade de manter registos da sua actividade, para que possam ser facilmente consultados futuramente. A nossa própria evolução depende, em larga medida, deste processo iterativo em que cada iteração se baseia nestes registos. O aparecimento da web e o seu sucesso incrementaram significativamente a disponibilidade da informação que rapidamente se tornou ubíqua. No entanto, a ausência de controlo editorial origina uma grande heterogeneidade sob vários aspectos. As técnicas tradicionais em recuperação de informação provam ser insuficientes para este novo meio. A recuperação de informação na web é a evolução natural da área de recuperação de informação para o meio web. Neste artigo apresentamos uma análise retrospectiva e, esperamos, abrangente desta área do conhecimento Humano.

Palavras-chave: Recuperação de informação na web, motores de pesquisa.

]]> Abstract. Human kind felt, since early ages, the need to keep records of its achievements that could persist through time and that could be easily retrieved for later reference. Our own evolution depends largely on this iterative process, where each iteration is based on these records. The advent of the web and its attractiveness highly increased the availability of information which rapidly becomes ubiquituous. However, the lack of editorial control originates high heterogeneity in several ways. The traditional information retrieval techniques face new, challenging problems and prove to be inefficient to deal with web characteristics. In this paper we present a comprehensive and retrospective overview of web information retrieval.

Keywords: Web information retrieval, search engines.

Texto completo disponível apenas em PDF.

Full text only available in PDF format.

References

Aas, K., Eikvil, L. (1999), Text Categorization: A Survey, Norwegian Computing Center [ Links ]

Aggarwal, C.C., Al-Garawi, F., Yu, P. (2001), Intelligent crawling on the World Wide Web with arbitrary predicates, Proceedings of the 10^th World Wide Web Conference

Aggarwal, C.C. (2004), On Leveraging User access Patterns for Topic Specific Crawling, Data mining and Knowledge Discovery, 9, pp 123-145, Kluwer Academic Publishers

Apostolico, A., Baeza-Yates, R., Melucci, M. (2006), Advances in information retrieval: an introduction to the special issue, Journal of Information Systems, Elsevier Science Ltd., 31(7), p.569-572

Arcot, H.G.A. (2004) Perception-based fuzzy information retrieval. United States -- California: San Jose State University

Baeza-Yates, R. (2003), Information Retrieval in the Web: beyond current search engines, Elsevier International Journal of Approximate Reasoning, 34, pp 97-104

Baeza-Yates, R., Ribeiro-Neto, B. (1999), Modern Information Retrieval. ACM Press

Baldi, P., Frasconi, P., Smyth, P. (2003), Modeling the Internet and the Web. Probabilistic Methods and Algorithms, Wiley

Beitzel, Steven M. (2006) On understanding and classifying web queries, PhD dissertation USA, Illinois, Illinois Institute of Technology

Bennet, K.P., Demiriz, A. (1998), Semi-Supervised Support Vector Machines, Proceeding of Neural Information Processing Systems

Berners-Lee, T. (1989), Information Management: a proposal., CERN

]]> Berners-Lee, T., Hendler, J., Lassila, O. (2001), The Semantic Web. Scientific American

Blum, A., Mitchell, T. (1998), Combining labelled and unlabelled data with Co‑training, Proceedings of the 11^th Annual Conference on Computational Learning Theory, pp 92-100

Borges, J.L.C.M. (2000), A Data Mining Model to Capture User Web Navigation Patterns, PhD dissertation, University of London

Brin, S., Page, L. (1998), “The anatomy of a large-scale hypertextual web search engine”, Proceedings of the 7^th World Wide Web Conference, pp 107-117

Broder, A. (2002) A taxonomy of web search. SIGIR Forum. 36:2. p. 3-10

Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., Wiener, J. (2000), Graph structure in the web World Wide Web Conference, Amsterdam, Holand

Broder, A., Maarek, Y., Bharat, K., Dumais, S., Papa, S., Pedersen, J., Raghavan, P.(2005), Current Trends in the Integration of Searching and Browsing, Special interest tracks and posters of the 14th World Wide Web Conference , Chiba, Japan, p.793

Bruza, P., McArthur, R., Dennis, S. (2000), Interactive Internet search: keyword, directory and query reformulation mechanisms compared, Research and Development in Information Retrieval

Bush, V. (1945), As We May Think, The Atlantic Monthly, July

Carey, M., Kriwaczek, F., Ruger, S.M. (2000), A Visualization Interface for Document Searching and Browsing, Proceedings of the NPIVM 2000

]]> Chakrabarti, S. (2003), Mining the Web. Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers

Chakrabarti, S., Dom, B., Agrawal, R., Raghavan, P. (1998a), Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies, The VLDB Journal, 7, pp 163-178

Chakrabarti, S., Dom, B., Indyk, P. (1998b), Enhanced hypertext categorization using hyperlinks, Proceedings of ACM SIGMOD International Conference on Management of data, pp 307-318

Chakrabarti, S., Byron, E., Kumar, S., Raghavan, P., Rajagopalan, S., Tomkins, A., Gibson, D., Kleinberg, J. (1999a), Mining Web Link Structure, IEEE Computer, 32(8), pp 60-67

Chakrabarti, S., Berg, M., Dom, B. (1999b), Focused crawling: a new approach to topic‑specific resource discovery, Proceedings of the 8^th World Wide Web Conference

Chewar, C.M., Krowne, A., O´Laughlen, M. (2001), User Object Collections: Visualization Concepts by collection-Insight Need, CITIDEL project

Cho, J., Garcia-Molina, H. (2000), Estimating Frequency of Change, Technical report, Stanford University

Cleverdon, C.W. (1991), The significance of the Cranfield tests on index languages, Proceedings of the ACM – SIGIR, p. 3-12

Cleverdon, C.W. (1962), Comparative Efficiency of Indexing Systems, Cranfield

Cleverdon, C.W., Aitchison, J. (1963), Test of the Index of Metallurgical Literature, Cranfield

]]> Cleverdon, C.W., Thorne, R.G. (1954), An Experiment with the Uniterm System, R.A.E. Cranfield, 7

Codd, E.F. (1970), A Relational Model of Data for Large Shared Data Banks, Communications of the ACM, Vol. 13, No. 6, June 1970, pp. 377-387

Cooley, R., Mobasher, B., Srivastava, J.(1997), Web Mining: Information and Pattern Discovery on the World Wide Web, Proceedings of the 9th IEEE International conference on tools with Artificial Intelligence, pp 558-567

Cormack, G.V., Palmer, C.R, Clarke, C.L.A. (1998), Efficient Construction of Large Test Collections, Proceedings of the ACM SIGIR 1998 Conference

Crestani, F., Shengli, W. (2006), Testing the cluster hypothesis in distributed information retrieval, Information Processing and Management. 42, p. 1137-1150

Crestani, F., Ruthven (2007), I., Introduction to special issue on contextual information retrieval systems. Information Retrieval. 10, p. 111-113

Croft, W.B. (2003), Information retrieval and computer science: an evolving relationship, ACM SIGIR Conference, Toronto, Canada, p.2-3

Cugini, J., Piatko, C., Laskowski, S. (1996), Interactive 3D Visualization for Document Retrieval, Proceedings of the ACM Conference on Information and Knowledge Management

Dao, T. (1998), An Indexing Model for Structured Documents to Support Queries on Content, Structure and Attributes, Proceedings of IEEE ADL Conference, Santa Barbara, California, USA

Dewey, M. (2004), A Classification and Subject Index for Cataloguing and Arranging the Books and Pamphlets of a Library, Project Gutenberg EBook

]]> Domingos, P. (2007), What's missing in AI: The Interface Layer, University of Washington, Washington, USA

Domingos, P., Kok, S., Poon, H., Richardson, M., Singla, P. (2006), Unifying Logical and Statistical AI, The Twenty-First National Conference on Artificial Intelligence and the Eighteenth Innovative Applications of Artificial Intelligence Conference, Boston, Massachusetts, USA

Donato, D., Laura, L., Millozi, S. (2000), A beginner’s guide to the Webgraph: Properties, Models and Algorithms, Proceedings of the 41^st FOCS, pp.57-65

Escudeiro, N., Jorge, A., (2006) Semi-automatic Creation and Maintenance of Web Resources with webTopic. Semantics, Web and Mining. LNCS, vol. 4289, pp. 82-102, Springer, Heidelberg

Glover, E.J., Flake, G.W., Lawrence, S., Birmingham, P., Kruger, A., Giles, C.L., Pennock, D.M. (2001), Improving Category Specific Web Search by Learning Query Modifications, Symposium on Applications and the Internet, IEEE Computer Society, pp 23-31

Gulli, A., Signorini A. (2005), The Indexable Web is More than 11.5 billion pages. In: WWW 2005, Chiba, Japan

Halkidi, M., Nguyen, B., Varlamis, I., Vazirgiannis, M. (2003), “Thesus: Organizing Web document collections based on link semantics”, The VLDB Journal, 12, pp 320-332

Hammouda, K.M., Kamel, M.S. (2004), Efficient Phrase-Based Document Indexing for Web Document Indexing. IEEE Transactions on Knowledge and Data Engineering. 16:10, p. 1279-1296

Haveliwala, T.H. (2005), Context-sensitive Web search, PhD dissertation, Stanford University, California, USA

Henzinger, M., Motwani, R., Silverstein, C. (2003), Challenges in Web Search Engines, 18th International Joint Conference on Artificial Intelligence

]]> Hersovici, M., Jacovi, M., Maarek, Y.S., Pelleg, D., Shtalaim, M., Ur, S. (1998), The Shark-search algorithm. An application: tailored web site mapping, Computer Networks 30(1-7), pp 317-326

Hu, W., (2002), World Wide Web Search Technologies, Architectural Issues of Web‑Enables Electronis Business, edited by Shi Nansi for Idea Group Publishing

Ifrim, G., Theobald, M., Weikum, G. (2005), Learning Word-to-Concept Mappings for Automatic Text Classification, International Conference on Machine Learning

Jardine, N., Rijsbergen, C.J. (1971), The use of hierarchic clustering in information retrieval, Information Storage and Retrieval, 7(5), pp. 217-240

Joachims, T. (1998), Text Categorization with Support Vector Machines: Learning with Many Relevant Features, Research Report of the unit no. VIII(AI), Computer Science Department of the University of Dortmund

Kandogan, E. (2001), Visualizing Multi-dimensional Clusters, Trends, and Outliers using Star Coordinates, Proceedings of the KDD Conference, San Francisco, Califormia, USA

Kahle, B. (1997), Preserving the internet, Scientific American. 276:3, p. 82-83

Kleinberg, J. (1998), Authoritative sources in a hyperlinked environment, Proceedings of the 9^th ACM-SIAM Symposium on Discrete Algorithms, pp 668-677

Koller, D., Sahami, M. (1996), Toward Optimal Feature Selection, Proceedings of the 13^th International Conference on Machine Learning, pp. 284-292, Morgan Kaufmann

Kosala, R., Blockeel, H. (2000), Web Mining Research: A Survey, SIGKDD Explorations, Vol. 2, No. 1, pp 1-13

]]> Kumar, R., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tomkins, A., Upfal, E. (2000), The Web as a graph, Proceedings of the 19th ACM SIGACT-SIGMOD-AIGART Symp. Principles of Database Systems

Lafferty, J., McCallum, A., Pereira, F. (2001), Conditional random fields: Probabilistic models for segmenting and labeling sequence data , 18th International Conference on Machine Learning, 2001

Lawrence, S., Bollacker, K., Giles, C.L. (1999), Indexing and Retrieval of Scientific Literature, Proceedings of the 8^th International Conference on Information and Knowledge Management, pp 139-146

Lewandowski, D. (2005), Web searching, search engines and Information Retrieval. Information Services and Use. 25:3-4/2005, p. 137-147

Li, X., Liu, B. (2003), Learning to classify text with positive and unlabelled data, Proceeding of IJCAI – 2003

Lim, L., Wang, M., Padmanabhan, S., Vitter, J.S., Agarwal, R. (2001), Characterizing Web Document Change, Lecture notes in Computer Science

Liu, R. L., Lin, W. J.(2005), Incremental mining of information interest for personalized web scanning, Information Systems journal, 30(8), p. 630-648

Lu, S., Dong, M., Fotouhi, F. (2002), The semantic web: opportunities and challenges for next generation web applications, Information Research, 7 (4)

Mitra, M., Singhal, A., Buckley, C. (1998), Improving automatic query expansion, Proceedings of the 21st ACM SIGIR Conference

Nelson, T. (1965), A file structure for the complex, the changing, and the indeterminate, ACM National Conference, 84-100

]]> Nicola, C., Gaussier, E., Goutte, C., Renders, J. M. (2003), “Word-Sequence Kernels”, Journal of Machine Learning Research, Nº 3, pp 1053-1082

Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.M. (2000), Text classification from labeled and unlabeled documents using EM, Machine Learning, 39, pp 103-134

Olsen, K.A., Korfhage, R.R., Sochats, K.M., Spring, M.B., Williams, J.G. (1992), Visualization of a Document Collection: The VIBE System, Information Processing & Management, Vol. 29, No. 1, pp 69-81

Orengo, V., Huyck, C. (2001), A Stemming Algorithm for the Portuguese Language, Proceedings of the 8^th SPIRE

O’Reily (2004), Web 2.0

Porter, M.F. (1980), “An algorithm for suffix stripping”, Program, 14, No. 3, pp 130‑137

Richardson, M., Domingos, P. (2004) Combining Link and Content Information in Web Search, Washington University, Washington, USA

Rijsbergen, K. (1979), Information Retrieval, Butherworth

Sahami, M. (2004),The happy searcher: Challenges in the web information retrieval, Pacific Rim International Conference on Artificial Intelligence, 3157, p.3-12

Salton, G., Lesk, M.E. (1965), The SMART automatic document retrieval systems - an illustration, Communications of the ACM, 8:6 (June 1965), p.391-398

]]> Salton, G., McGill, M. (1983), Introduction to Modern Information Retrieval, McGraw-Hill

Salton, Wong, Yang (1975), A vector space model for automatic indexing. Communications of the ACM. 18:11 (1975). p. 613-620

Shen, G. (2005), Formal concepts and applications, PhD dissertation, Case Western Reserve University, Ohio, USA

Shwarzkopf, E. (2003), Personalized Interaction with Semantic Information Portals, German Research Center for Artificial Intelligence

Siddiqui, Tanveer, J. (2006), Intelligent Techniques for Effective Information Retrieval (A Conceptual Graph Based Approach), ACM SIGIR Forum. 40:2

Spangler, S., Kreulen, J.T., Lessler, J. (2003), Generating and Browsing Multiple Taxonomies Over a Document Collection, Journal of Management Information Systems, 19(4), p. 191-212

Viji, S. (2002), Term and Document Correlation and Visualization for a set of Documents, Technical report, Stanford University

Voorhees, E.M. (1998), Variations in Relevance Judgements and the Measurement of Retrieval Effectiveness, Proceedings of the ACM SIGIR 1998 Conference

Wang, J., Lochovsky, F. (2003), “Web Search Engines”, Journal of ACM Computing Survey (accepted for revision)

Wolf, K.E. (1993), A First Course in Formal Concept Analysis, Advances in Statistical Software, 4, p. 429-438

]]> Yang, Y. (1999), An Evaluation of Statistical Approaches t

o Text Categorization, Journal of Information Retrieval, vol. 1, nos. 1/2, pp 67-88

Yang, Y., Pederson, J. (1997), “A Comparative Study of Feature Selection in Text Categorization”, International Conference on Machine Learning

Yang, Y., Slattery, S., Ghani, R. (2002), A Study of Approaches to Hypertext Categorization, Kluwer Academic Publishers, pp. 1-25

Zakos, J., Verma, B. (2006), A Novel Context-based Technique for Web Information Retrieval, World Wide Web, 9(4), p. 485-503

Zamir, O., Etzioni, O. (1999), Grouper: A Dynamic clustering Interface to Web Search Results, Proceedings of the 1999 World Wide Web Conference

^* Supported by the POSC/EIA/58367/2004/Site-o-Matic Project (Fundação Ciência e Tecnologia), FEDER e Programa de Financiamento Plurianual de Unidades de I & D.

^[1] DEI-ISEP – Deptº de Engenharia Informática, Instituto Superior de Engenharia do Porto ; http://www.dei.isep.ipp.pt

^[2] 2FEP-UP – Faculdade de Economia, Universidade do Porto; http://www.fep.up.pt

]]> • LIAAD, INESC Porto LA – Laboratório de Inteligência Artificial e Análise de Dados; http://www.liaad.up.pt

]]>

1999