ssn

field notes

User Tools

Site Tools


teach:dapi:202021:lectures:03

L: 09/10/2020 §

Master in Informatics and Computing Engineering
Information Description, Storage and Retrieval
Instance: 2020/2021


 

Lecture #3 :: 09/10/2020 §

Goals §

By the end of this class, the student should be able to: 

  • identify situations where information retrieval takes place; 
  • name some retrieval tools, their goals and features; 
  • enumerate typical information retrieval tasks; 
  • distinguish between information retrieval and data retrieval; 
  • define document, collection, information need, query, search result, relevance; 
  • describe the origin and the milestones in the evolution of information retrieval. 

Topics §

  1. Retrieval tasks and systems 
    • Components of an information retrieval system 
  2. Information Retrieval concepts 
    • Collections, documents, queries, results 
    • Users, information needs 
  3. Evaluation of retrieval systems 
    • Tasks 
    • Test collections 
    • Measures 
  4. The history of Information Retrieval (IR) 
    • relation to text and data mining, machine learning, natural language processing 

Bibliography §

  • Ricardo Baeza-Yates, Berthier Ribeiro-Neto, Modern Information Retrieval: The Concepts and Technology behind Search, http://www.mir2ed.org/, Addison-Wesley Professional 2nd edition, 2011 
  • Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, http://nlp.stanford.edu/IR-book/, Cambridge University Press. 2008 
  • W. Bruce Croft, Donald Metzler, Trevor Strohman, Search Engines: Information Retrieval in Practice, http://ciir.cs.umass.edu/downloads/SEIRiP.pdf, Pearson, 2009 

Materials §

Tasks §

0.1 What is IR? §

What is IR? Is IR what we are doing every time we are querying a database, searching on the web, locating resources in a library catalog, exploring a large digital library or browsing the site of an online shop? 

Modern Information Retrieval, by Baeza-Yates and Ribeiro-Neto, is one of our references http://grupoweb.upf.edu/mir2ed/ 

Read this chapter: http://grupoweb.upf.edu/mir2ed/pdf/chapter1.pdf, namely (1.2.2 Information versus Data Retrieval). The next questions are on this. 

  1. Give an example of an IR session: user, goal, system used. 
  2. Give an example of a data retrieval session: user, goal, system used. 
  3. Result lists are typical in IR and in data retrieval systems; can you have “wrong” entries in these lists, why or why not? 

0.2 Milestones in IR §

IR has both a “librarian” and a “computational” ancestry. Check a timeline for IR, such as http://en.wikipedia.org/wiki/Information_retrieval#Timeline for some salient names, do some search and get details on their contributions. 

  1. Vannevar Bush (1945) 
  2. Cyril Cleverdon (1960) 
  3. Karen Sparck-Jones (1970) 
  4. C. J. van Rijsbergen (1980) 
  5. Brin & Page (1990) 

0.3 Documents, collections, tasks and users §

One of the principles in IR is that information is embodied in “documents”, and a document can be any chunk of text, image, speech that we may take individually as an identifiable resource. 

Information retrieval systems work on “collections”: sets of documents that are indexed and become candidates to the list of results for a query. 

Users are at the center of the stage in IR: they provide the “information need”, get the results lists and provide the relevance judgements. 

The evolution of tasks in IR is visible in the TREC organisation: an initiative where IR researchers get to compare their achievements, measuring how well they perform in a specific task (data and methods are agreed in the corresponding “track”). 

Go to the TREC site http://trec.nist.gov/ and browse some recent tracks http://trec.nist.gov/tracks.html and some older ones http://trec.nist.gov/proceedings/proceedings.html. Answer the following questions. 

  1. Select one of the 2018 tracks and identify: 
    1. the task; 
    2. the collection or collections; 
    3. what is a document? 
    4. the intended users. 
  2. Repeat for one of the older tracks. 

0.4 Search Engine Evolution §

Search engines started as devices for exploring scientific literature and then evolved to the web. The web as a document collection brought new requirements and many challenges. Major breakthroughs were introduced by Google in the 1990's. The paper by Brin and Page, “The anatomy of a large-scale hypertextual Web search engine” http://infolab.stanford.edu/~backrub/google.html highlights several innovations introduced by Google that created a new generation of search engines. Check the paper for two of these: 

  1. PageRank, the citation-based page weight function; 
  2. Use of anchor text associated to links and their targets. 

0.5 Search Engine Experimentation §

Search Engines are currently available as software components to be assembled and provide search in any software environment. One well-known engine is Solr, based on the Lucene IR library, supported by the Apache Software Foundation. 

This task is to install a Solr instance and start experimenting with indexing and searching in small document collections. You can use the online guides: https://lucene.apache.org/solr/guide/ 

You can also use Docker to quickly deploy a Solr instance: https://hub.docker.com/_/solr/ 

Summary §

  • Review of each group's project: data collection and characterization. 
  • The retrieval process. The components of an information retrieval system. 
  • Retrieval tasks. Collections, documents and queries. 
  • Information Retrieval: history from 1940. The links to information science, natural language processing, information extraction, statistics, human-computer interaction. 

MCR, JCL, SSN 

« Previous | Index | Next » 

teach/dapi/202021/lectures/03.txt · Last modified: 2020/10/01 16:22 by ssn

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki