2020 §

Master in Informatics and Computing Engineering
Information Description, Storage and Retrieval
Instance: 2020/2021
—
¶

Lecture #3 :: 09/10/2020 §

Goals §

By the end of this class, the student should be able to: ¶

identify situations where information retrieval takes place; ¶
name some retrieval tools, their goals and features; ¶
enumerate typical information retrieval tasks; ¶
distinguish between information retrieval and data retrieval; ¶
define document, collection, information need, query, search result, relevance; ¶
describe the origin and the milestones in the evolution of information retrieval. ¶

Topics §

Retrieval tasks and systems ¶
- Components of an information retrieval system ¶
Information Retrieval concepts ¶
- Collections, documents, queries, results ¶
- Users, information needs ¶
Evaluation of retrieval systems ¶
- Tasks ¶
- Test collections ¶
- Measures ¶
The history of Information Retrieval (IR) ¶
- relation to text and data mining, machine learning, natural language processing ¶

Bibliography §

Ricardo Baeza-Yates, Berthier Ribeiro-Neto, Modern Information Retrieval: The Concepts and Technology behind Search, http://www.mir2ed.org/, Addison-Wesley Professional 2nd edition, 2011 ¶
Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, http://nlp.stanford.edu/IR-book/, Cambridge University Press. 2008 ¶
W. Bruce Croft, Donald Metzler, Trevor Strohman, Search Engines: Information Retrieval in Practice, http://ciir.cs.umass.edu/downloads/SEIRiP.pdf, Pearson, 2009 ¶

Materials §

Sergey Brin and Lawrence Page, The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30, 1-7 (April 1998), 107-117. DOI=10.1016/S0169-7552(98)00110-X http://infolab.stanford.edu/~backrub/google.html, accessed October 2020 ¶
Vannevar Bush, As We May Think, The Atlantic Monthly, 176(1):101-108, July 1945, http://www.theatlantic.com/magazine/archive/1945/07/as-we-may-think/3881/, accessed October 2020 ¶
W. Bruce Croft, Donald Metzler, Trevor Strohman, Search Engines: Information Retrieval in Practice, Slides for Chapter 1, Pearson, 2009 ¶
Ricardo Baeza-Yates, Berthier Ribeiro-Neto, Modern Information Retrieval: The Concepts and Technology behind Search, Chapter 1: Introduction, accessed October 2020 ¶
Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Chapter 1: Boolean Retrieval, accessed October 2020 ¶
Other resources (online materials, conferences, research groups): IIR resources, MIR resources accessed October 2020 ¶
Text Retrieval Conference, TREC, accessed October 2020 ¶
ACM Special Interest Group in Information Retrieval, SIGIR, accessed October 2020 ¶
IR Timeline http://en.wikipedia.org/wiki/Information_retrieval#Timeline, accessed October 2020 ¶

Tasks §

0.1 What is IR? §

What is IR? Is IR what we are doing every time we are querying a database, searching on the web, locating resources in a library catalog, exploring a large digital library or browsing the site of an online shop? ¶

Modern Information Retrieval, by Baeza-Yates and Ribeiro-Neto, is one of our references http://grupoweb.upf.edu/mir2ed/ ¶

Read this chapter: http://grupoweb.upf.edu/mir2ed/pdf/chapter1.pdf, namely (1.2.2 Information versus Data Retrieval). The next questions are on this. ¶

Give an example of an IR session: user, goal, system used. ¶
Give an example of a data retrieval session: user, goal, system used. ¶
Result lists are typical in IR and in data retrieval systems; can you have “wrong” entries in these lists, why or why not? ¶

0.2 Milestones in IR §

IR has both a “librarian” and a “computational” ancestry. Check a timeline for IR, such as http://en.wikipedia.org/wiki/Information_retrieval#Timeline for some salient names, do some search and get details on their contributions. ¶

Vannevar Bush (1945) ¶
Cyril Cleverdon (1960) ¶
Karen Sparck-Jones (1970) ¶
C. J. van Rijsbergen (1980) ¶
Brin & Page (1990) ¶

0.3 Documents, collections, tasks and users §

One of the principles in IR is that information is embodied in “documents”, and a document can be any chunk of text, image, speech that we may take individually as an identifiable resource. ¶

Information retrieval systems work on “collections”: sets of documents that are indexed and become candidates to the list of results for a query. ¶

Users are at the center of the stage in IR: they provide the “information need”, get the results lists and provide the relevance judgements. ¶

The evolution of tasks in IR is visible in the TREC organisation: an initiative where IR researchers get to compare their achievements, measuring how well they perform in a specific task (data and methods are agreed in the corresponding “track”). ¶

Go to the TREC site http://trec.nist.gov/ and browse some recent tracks http://trec.nist.gov/tracks.html and some older ones http://trec.nist.gov/proceedings/proceedings.html. Answer the following questions. ¶

Select one of the 2018 tracks and identify: ¶
1. the task; ¶
2. the collection or collections; ¶
3. what is a document? ¶
4. the intended users. ¶
Repeat for one of the older tracks. ¶

0.4 Search Engine Evolution §

Search engines started as devices for exploring scientific literature and then evolved to the web. The web as a document collection brought new requirements and many challenges. Major breakthroughs were introduced by Google in the 1990's. The paper by Brin and Page, “The anatomy of a large-scale hypertextual Web search engine” http://infolab.stanford.edu/~backrub/google.html highlights several innovations introduced by Google that created a new generation of search engines. Check the paper for two of these: ¶

PageRank, the citation-based page weight function; ¶
Use of anchor text associated to links and their targets. ¶

0.5 Search Engine Experimentation §

Search Engines are currently available as software components to be assembled and provide search in any software environment. One well-known engine is Solr, based on the Lucene IR library, supported by the Apache Software Foundation. ¶

This task is to install a Solr instance and start experimenting with indexing and searching in small document collections. You can use the online guides: https://lucene.apache.org/solr/guide/ ¶

You can also use Docker to quickly deploy a Solr instance: https://hub.docker.com/_/solr/ ¶

Summary §

Review of each group's project: data collection and characterization. ¶
The retrieval process. The components of an information retrieval system. ¶
Retrieval tasks. Collections, documents and queries. ¶
Information Retrieval: history from 1940. The links to information science, natural language processing, information extraction, statistics, human-computer interaction. ¶

— MCR, JCL, SSN ¶

« Previous | Index | Next » ¶

ssn

Table of Contents