Table of Contents
L: 09/10/2020 §
Master in Informatics and Computing Engineering
Information Description, Storage and Retrieval
Instance: 2020/2021
—
¶
Lecture #3 :: 09/10/2020 §
Goals §
By the end of this class, the student should be able to: ¶
- identify situations where information retrieval takes place; ¶
- name some retrieval tools, their goals and features; ¶
- enumerate typical information retrieval tasks; ¶
- distinguish between information retrieval and data retrieval; ¶
- define document, collection, information need, query, search result, relevance; ¶
- describe the origin and the milestones in the evolution of information retrieval. ¶
Topics §
Bibliography §
- Ricardo Baeza-Yates, Berthier Ribeiro-Neto, Modern Information Retrieval: The Concepts and Technology behind Search, http://www.mir2ed.org/, Addison-Wesley Professional 2nd edition, 2011 ¶
- Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, http://nlp.stanford.edu/IR-book/, Cambridge University Press. 2008 ¶
- W. Bruce Croft, Donald Metzler, Trevor Strohman, Search Engines: Information Retrieval in Practice, http://ciir.cs.umass.edu/downloads/SEIRiP.pdf, Pearson, 2009 ¶
Materials §
- Sergey Brin and Lawrence Page, The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30, 1-7 (April 1998), 107-117. DOI=10.1016/S0169-7552(98)00110-X http://infolab.stanford.edu/~backrub/google.html, accessed October 2020 ¶
- Vannevar Bush, As We May Think, The Atlantic Monthly, 176(1):101-108, July 1945, http://www.theatlantic.com/magazine/archive/1945/07/as-we-may-think/3881/, accessed October 2020 ¶
- W. Bruce Croft, Donald Metzler, Trevor Strohman, Search Engines: Information Retrieval in Practice, Slides for Chapter 1, Pearson, 2009 ¶
- Ricardo Baeza-Yates, Berthier Ribeiro-Neto, Modern Information Retrieval: The Concepts and Technology behind Search, Chapter 1: Introduction, accessed October 2020 ¶
- Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Chapter 1: Boolean Retrieval, accessed October 2020 ¶
- Other resources (online materials, conferences, research groups): IIR resources, MIR resources accessed October 2020 ¶
- IR Timeline http://en.wikipedia.org/wiki/Information_retrieval#Timeline, accessed October 2020 ¶
Tasks §
0.1 What is IR? §
What is IR? Is IR what we are doing every time we are querying a database, searching on the web, locating resources in a library catalog, exploring a large digital library or browsing the site of an online shop? ¶
Modern Information Retrieval, by Baeza-Yates and Ribeiro-Neto, is one of our references http://grupoweb.upf.edu/mir2ed/ ¶
Read this chapter: http://grupoweb.upf.edu/mir2ed/pdf/chapter1.pdf, namely (1.2.2 Information versus Data Retrieval). The next questions are on this. ¶
0.2 Milestones in IR §
IR has both a “librarian” and a “computational” ancestry. Check a timeline for IR, such as http://en.wikipedia.org/wiki/Information_retrieval#Timeline for some salient names, do some search and get details on their contributions. ¶
0.3 Documents, collections, tasks and users §
One of the principles in IR is that information is embodied in “documents”, and a document can be any chunk of text, image, speech that we may take individually as an identifiable resource. ¶
Information retrieval systems work on “collections”: sets of documents that are indexed and become candidates to the list of results for a query. ¶
Users are at the center of the stage in IR: they provide the “information need”, get the results lists and provide the relevance judgements. ¶
The evolution of tasks in IR is visible in the TREC organisation: an initiative where IR researchers get to compare their achievements, measuring how well they perform in a specific task (data and methods are agreed in the corresponding “track”). ¶
Go to the TREC site http://trec.nist.gov/ and browse some recent tracks http://trec.nist.gov/tracks.html and some older ones http://trec.nist.gov/proceedings/proceedings.html. Answer the following questions. ¶
0.4 Search Engine Evolution §
Search engines started as devices for exploring scientific literature and then evolved to the web. The web as a document collection brought new requirements and many challenges. Major breakthroughs were introduced by Google in the 1990's. The paper by Brin and Page, “The anatomy of a large-scale hypertextual Web search engine” http://infolab.stanford.edu/~backrub/google.html highlights several innovations introduced by Google that created a new generation of search engines. Check the paper for two of these: ¶
0.5 Search Engine Experimentation §
Search Engines are currently available as software components to be assembled and provide search in any software environment. One well-known engine is Solr, based on the Lucene IR library, supported by the Apache Software Foundation. ¶
This task is to install a Solr instance and start experimenting with indexing and searching in small document collections. You can use the online guides: https://lucene.apache.org/solr/guide/ ¶
You can also use Docker to quickly deploy a Solr instance: https://hub.docker.com/_/solr/ ¶
Summary §
- Review of each group's project: data collection and characterization. ¶
- The retrieval process. The components of an information retrieval system. ¶
- Retrieval tasks. Collections, documents and queries. ¶
- Information Retrieval: history from 1940. The links to information science, natural language processing, information extraction, statistics, human-computer interaction. ¶
— MCR, JCL, SSN ¶
« Previous | Index | Next » ¶