
field notes

User Tools

Site Tools


L: 09/10/2020 §

Master in Informatics and Computing Engineering
Information Description, Storage and Retrieval
Instance: 2020/2021


Lecture #3 :: 09/10/2020 §

Goals §

By the end of this class, the student should be able to: 

  • identify situations where information retrieval takes place; 
  • name some retrieval tools, their goals and features; 
  • enumerate typical information retrieval tasks; 
  • distinguish between information retrieval and data retrieval; 
  • define document, collection, information need, query, search result, relevance; 
  • describe the origin and the milestones in the evolution of information retrieval. 

Topics §

  1. Retrieval tasks and systems 
    • Components of an information retrieval system 
  2. Information Retrieval concepts 
    • Collections, documents, queries, results 
    • Users, information needs 
  3. Evaluation of retrieval systems 
    • Tasks 
    • Test collections 
    • Measures 
  4. The history of Information Retrieval (IR) 
    • relation to text and data mining, machine learning, natural language processing 

Bibliography §

  • Ricardo Baeza-Yates, Berthier Ribeiro-Neto, Modern Information Retrieval: The Concepts and Technology behind Search,, Addison-Wesley Professional 2nd edition, 2011 
  • Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval,, Cambridge University Press. 2008 
  • W. Bruce Croft, Donald Metzler, Trevor Strohman, Search Engines: Information Retrieval in Practice,, Pearson, 2009 

Materials §

Tasks §

0.1 What is IR? §

What is IR? Is IR what we are doing every time we are querying a database, searching on the web, locating resources in a library catalog, exploring a large digital library or browsing the site of an online shop? 

Modern Information Retrieval, by Baeza-Yates and Ribeiro-Neto, is one of our references 

Read this chapter:, namely (1.2.2 Information versus Data Retrieval). The next questions are on this. 

  1. Give an example of an IR session: user, goal, system used. 
  2. Give an example of a data retrieval session: user, goal, system used. 
  3. Result lists are typical in IR and in data retrieval systems; can you have “wrong” entries in these lists, why or why not? 

0.2 Milestones in IR §

IR has both a “librarian” and a “computational” ancestry. Check a timeline for IR, such as for some salient names, do some search and get details on their contributions. 

  1. Vannevar Bush (1945) 
  2. Cyril Cleverdon (1960) 
  3. Karen Sparck-Jones (1970) 
  4. C. J. van Rijsbergen (1980) 
  5. Brin & Page (1990) 

0.3 Documents, collections, tasks and users §

One of the principles in IR is that information is embodied in “documents”, and a document can be any chunk of text, image, speech that we may take individually as an identifiable resource. 

Information retrieval systems work on “collections”: sets of documents that are indexed and become candidates to the list of results for a query. 

Users are at the center of the stage in IR: they provide the “information need”, get the results lists and provide the relevance judgements. 

The evolution of tasks in IR is visible in the TREC organisation: an initiative where IR researchers get to compare their achievements, measuring how well they perform in a specific task (data and methods are agreed in the corresponding “track”). 

Go to the TREC site and browse some recent tracks and some older ones Answer the following questions. 

  1. Select one of the 2018 tracks and identify: 
    1. the task; 
    2. the collection or collections; 
    3. what is a document? 
    4. the intended users. 
  2. Repeat for one of the older tracks. 

0.4 Search Engine Evolution §

Search engines started as devices for exploring scientific literature and then evolved to the web. The web as a document collection brought new requirements and many challenges. Major breakthroughs were introduced by Google in the 1990's. The paper by Brin and Page, “The anatomy of a large-scale hypertextual Web search engine” highlights several innovations introduced by Google that created a new generation of search engines. Check the paper for two of these: 

  1. PageRank, the citation-based page weight function; 
  2. Use of anchor text associated to links and their targets. 

0.5 Search Engine Experimentation §

Search Engines are currently available as software components to be assembled and provide search in any software environment. One well-known engine is Solr, based on the Lucene IR library, supported by the Apache Software Foundation. 

This task is to install a Solr instance and start experimenting with indexing and searching in small document collections. You can use the online guides: 

You can also use Docker to quickly deploy a Solr instance: 

Summary §

  • Review of each group's project: data collection and characterization. 
  • The retrieval process. The components of an information retrieval system. 
  • Retrieval tasks. Collections, documents and queries. 
  • Information Retrieval: history from 1940. The links to information science, natural language processing, information extraction, statistics, human-computer interaction. 


« Previous | Index | Next » 

teach/dapi/202021/lectures/03.txt · Last modified: 2020/10/01 16:22 by ssn

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki