====== L: 09/10/2020 ======

**Master in Informatics and Computing Engineering\\
Information Description, Storage and Retrieval\\
Instance: 2020/2021**
\\
---
\\

====== Lecture #3 :: 09/10/2020 ======

===== Goals =====

By the end of this class, the student should be able to:
  * identify situations where information retrieval takes place;
  * name some retrieval tools, their goals and features;
  * enumerate typical information retrieval tasks;
  * distinguish between information retrieval and data retrieval;
  * define document, collection, information need, query, search result, relevance;
  * describe the origin and the milestones in the evolution of information retrieval.

===== Topics =====

  - Retrieval tasks and systems
    * Components of an information retrieval system
  - Information Retrieval concepts
    * Collections, documents, queries, results
    * Users, information needs
  - Evaluation of retrieval systems
    * Tasks
    * Test collections
    * Measures
  - The history of Information Retrieval (IR)
    * relation to text and data mining, machine learning, natural language processing

===== Bibliography =====

  * Ricardo Baeza-Yates, Berthier Ribeiro-Neto, //Modern Information Retrieval: The Concepts and Technology behind Search//, [[http://www.mir2ed.org/]], Addison-Wesley Professional 2nd edition, 2011
  * Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, //Introduction to Information Retrieval//, [[http://nlp.stanford.edu/IR-book/]], Cambridge University Press. 2008
  * W. Bruce Croft, Donald Metzler, Trevor Strohman, //Search Engines: Information Retrieval in Practice//, [[http://ciir.cs.umass.edu/downloads/SEIRiP.pdf]], Pearson, 2009

===== Materials =====

  * Sergey Brin and Lawrence Page, The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30, 1-7 (April 1998), 107-117. DOI=10.1016/S0169-7552(98)00110-X [[http://infolab.stanford.edu/~backrub/google.html]], accessed October 2020
  * Vannevar Bush, As We May Think, The Atlantic Monthly, 176(1):101-108, July 1945, [[http://www.theatlantic.com/magazine/archive/1945/07/as-we-may-think/3881/]], accessed October 2020
  * W. Bruce Croft, Donald Metzler, Trevor Strohman, //Search Engines: Information Retrieval in Practice//, [[http://www.search-engines-book.com/slides/|Slides for Chapter 1]], Pearson, 2009
  * Ricardo Baeza-Yates, Berthier Ribeiro-Neto, //Modern Information Retrieval: The Concepts and Technology behind Search//, [[http://grupoweb.upf.edu/mir2ed/pdf/chapter1.pdf|Chapter 1: Introduction]], accessed October 2020
  * Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, //Introduction to Information Retrieval//, [[http://nlp.stanford.edu/IR-book/pdf/01bool.pdf|Chapter 1: Boolean Retrieval]], accessed October 2020
  * Other resources (online materials, conferences, research groups): [[http://nlp.stanford.edu/IR-book/information-retrieval.html|IIR resources]], [[http://grupoweb.upf.edu/mir2ed/resources.php|MIR resources]] accessed October 2020
  * Text Retrieval Conference, [[http://trec.nist.gov/|TREC]], accessed October 2020
  * ACM Special Interest Group in Information Retrieval, [[http://sigir.org/|SIGIR]], accessed October 2020
  * IR Timeline [[http://en.wikipedia.org/wiki/Information_retrieval#Timeline]], accessed October 2020

===== Tasks =====

==== - What is IR? ====

What is IR? Is IR what we are doing every time we are querying a database, searching on the web, locating resources in a library catalog, exploring a large digital library or browsing the site of an online shop?

Modern Information Retrieval, by Baeza-Yates and Ribeiro-Neto, is one of our references http://grupoweb.upf.edu/mir2ed/

Read this chapter: http://grupoweb.upf.edu/mir2ed/pdf/chapter1.pdf, namely (1.2.2 Information versus Data Retrieval). The next questions are on this.
  - Give an example of an IR session: user, goal, system used.
  - Give an example of a data retrieval session: user, goal, system used.
  - Result lists are typical in IR and in data retrieval systems; can you have "wrong" entries in these lists, why or why not?

==== - Milestones in IR ====

IR has both a "librarian" and a "computational" ancestry. 
Check a timeline for IR, such as http://en.wikipedia.org/wiki/Information_retrieval#Timeline for some salient names, do some search and get details on their contributions.
  - Vannevar Bush (1945)
  - Cyril Cleverdon (1960)
  - Karen Sparck-Jones (1970)
  - C. J. van Rijsbergen (1980)
  - Brin & Page (1990)

==== - Documents, collections, tasks and users ====

One of the principles in IR is that information is embodied in "documents", and a document can be any chunk of text, image, speech that we may take individually as an identifiable resource.

Information retrieval systems work on "collections": sets of documents that are indexed and become candidates to the list of results for a query.

Users are at the center of the stage in IR: they provide the "information need", get the results lists and provide the relevance judgements.

The evolution of tasks in IR is visible in the TREC organisation: an initiative where IR researchers get to compare their achievements, measuring how well they perform in a specific task (data and methods are agreed in the corresponding "track").

Go to the TREC site http://trec.nist.gov/ and browse some recent tracks http://trec.nist.gov/tracks.html and some older ones http://trec.nist.gov/proceedings/proceedings.html. Answer the following questions.
  - Select one of the 2018 tracks and identify:
    - the task;
    - the collection or collections;
    - what is a document?
    - the intended users.
  - Repeat for one of the older tracks.

==== - Search Engine Evolution ====

Search engines started as devices for exploring scientific literature and then evolved to the web. The web as a document collection brought new requirements and many challenges. Major breakthroughs were introduced by Google in the 1990's. The paper by Brin and Page, "The anatomy of a large-scale hypertextual Web search engine" http://infolab.stanford.edu/~backrub/google.html highlights several innovations introduced by Google that created a new generation of search engines. Check the paper for two of these:
  - PageRank, the citation-based page weight function;
  - Use of anchor text associated to links and their targets.

==== - Search Engine Experimentation ====

Search Engines are currently available as software components to be assembled and provide search in any software environment. One well-known engine is Solr, based on the Lucene IR library, supported by the Apache Software Foundation.

This task is to install a Solr instance and start experimenting with indexing and searching in small document collections. 
You can use the online guides: https://lucene.apache.org/solr/guide/

You can also use Docker to quickly deploy a Solr instance: https://hub.docker.com/_/solr/

===== Summary =====

  * Review of each group's project: data collection and characterization.
  * The retrieval process. The components of an information retrieval system.
  * Retrieval tasks. Collections, documents and queries.
  * Information Retrieval: history from 1940. The links to information science, natural language processing, information extraction, statistics, human-computer interaction.

 --- //MCR, JCL, SSN//

[[02|« Previous]] | [[index|Index]] | [[04|Next »]]