====== L: 09/10/2020 ====== **Master in Informatics and Computing Engineering\\ Information Description, Storage and Retrieval\\ Instance: 2020/2021** \\ --- \\ ====== Lecture #3 :: 09/10/2020 ====== ===== Goals ===== By the end of this class, the student should be able to: * identify situations where information retrieval takes place; * name some retrieval tools, their goals and features; * enumerate typical information retrieval tasks; * distinguish between information retrieval and data retrieval; * define document, collection, information need, query, search result, relevance; * describe the origin and the milestones in the evolution of information retrieval. ===== Topics ===== - Retrieval tasks and systems * Components of an information retrieval system - Information Retrieval concepts * Collections, documents, queries, results * Users, information needs - Evaluation of retrieval systems * Tasks * Test collections * Measures - The history of Information Retrieval (IR) * relation to text and data mining, machine learning, natural language processing ===== Bibliography ===== * Ricardo Baeza-Yates, Berthier Ribeiro-Neto, //Modern Information Retrieval: The Concepts and Technology behind Search//, [[http://www.mir2ed.org/]], Addison-Wesley Professional 2nd edition, 2011 * Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, //Introduction to Information Retrieval//, [[http://nlp.stanford.edu/IR-book/]], Cambridge University Press. 2008 * W. Bruce Croft, Donald Metzler, Trevor Strohman, //Search Engines: Information Retrieval in Practice//, [[http://ciir.cs.umass.edu/downloads/SEIRiP.pdf]], Pearson, 2009 ===== Materials ===== * Sergey Brin and Lawrence Page, The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30, 1-7 (April 1998), 107-117. DOI=10.1016/S0169-7552(98)00110-X [[http://infolab.stanford.edu/~backrub/google.html]], accessed October 2020 * Vannevar Bush, As We May Think, The Atlantic Monthly, 176(1):101-108, July 1945, [[http://www.theatlantic.com/magazine/archive/1945/07/as-we-may-think/3881/]], accessed October 2020 * W. Bruce Croft, Donald Metzler, Trevor Strohman, //Search Engines: Information Retrieval in Practice//, [[http://www.search-engines-book.com/slides/|Slides for Chapter 1]], Pearson, 2009 * Ricardo Baeza-Yates, Berthier Ribeiro-Neto, //Modern Information Retrieval: The Concepts and Technology behind Search//, [[http://grupoweb.upf.edu/mir2ed/pdf/chapter1.pdf|Chapter 1: Introduction]], accessed October 2020 * Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, //Introduction to Information Retrieval//, [[http://nlp.stanford.edu/IR-book/pdf/01bool.pdf|Chapter 1: Boolean Retrieval]], accessed October 2020 * Other resources (online materials, conferences, research groups): [[http://nlp.stanford.edu/IR-book/information-retrieval.html|IIR resources]], [[http://grupoweb.upf.edu/mir2ed/resources.php|MIR resources]] accessed October 2020 * Text Retrieval Conference, [[http://trec.nist.gov/|TREC]], accessed October 2020 * ACM Special Interest Group in Information Retrieval, [[http://sigir.org/|SIGIR]], accessed October 2020 * IR Timeline [[http://en.wikipedia.org/wiki/Information_retrieval#Timeline]], accessed October 2020 ===== Tasks ===== ==== - What is IR? ==== What is IR? Is IR what we are doing every time we are querying a database, searching on the web, locating resources in a library catalog, exploring a large digital library or browsing the site of an online shop? Modern Information Retrieval, by Baeza-Yates and Ribeiro-Neto, is one of our references http://grupoweb.upf.edu/mir2ed/ Read this chapter: http://grupoweb.upf.edu/mir2ed/pdf/chapter1.pdf, namely (1.2.2 Information versus Data Retrieval). The next questions are on this. - Give an example of an IR session: user, goal, system used. - Give an example of a data retrieval session: user, goal, system used. - Result lists are typical in IR and in data retrieval systems; can you have "wrong" entries in these lists, why or why not? ==== - Milestones in IR ==== IR has both a "librarian" and a "computational" ancestry. Check a timeline for IR, such as http://en.wikipedia.org/wiki/Information_retrieval#Timeline for some salient names, do some search and get details on their contributions. - Vannevar Bush (1945) - Cyril Cleverdon (1960) - Karen Sparck-Jones (1970) - C. J. van Rijsbergen (1980) - Brin & Page (1990) ==== - Documents, collections, tasks and users ==== One of the principles in IR is that information is embodied in "documents", and a document can be any chunk of text, image, speech that we may take individually as an identifiable resource. Information retrieval systems work on "collections": sets of documents that are indexed and become candidates to the list of results for a query. Users are at the center of the stage in IR: they provide the "information need", get the results lists and provide the relevance judgements. The evolution of tasks in IR is visible in the TREC organisation: an initiative where IR researchers get to compare their achievements, measuring how well they perform in a specific task (data and methods are agreed in the corresponding "track"). Go to the TREC site http://trec.nist.gov/ and browse some recent tracks http://trec.nist.gov/tracks.html and some older ones http://trec.nist.gov/proceedings/proceedings.html. Answer the following questions. - Select one of the 2018 tracks and identify: - the task; - the collection or collections; - what is a document? - the intended users. - Repeat for one of the older tracks. ==== - Search Engine Evolution ==== Search engines started as devices for exploring scientific literature and then evolved to the web. The web as a document collection brought new requirements and many challenges. Major breakthroughs were introduced by Google in the 1990's. The paper by Brin and Page, "The anatomy of a large-scale hypertextual Web search engine" http://infolab.stanford.edu/~backrub/google.html highlights several innovations introduced by Google that created a new generation of search engines. Check the paper for two of these: - PageRank, the citation-based page weight function; - Use of anchor text associated to links and their targets. ==== - Search Engine Experimentation ==== Search Engines are currently available as software components to be assembled and provide search in any software environment. One well-known engine is Solr, based on the Lucene IR library, supported by the Apache Software Foundation. This task is to install a Solr instance and start experimenting with indexing and searching in small document collections. You can use the online guides: https://lucene.apache.org/solr/guide/ You can also use Docker to quickly deploy a Solr instance: https://hub.docker.com/_/solr/ ===== Summary ===== * Review of each group's project: data collection and characterization. * The retrieval process. The components of an information retrieval system. * Retrieval tasks. Collections, documents and queries. * Information Retrieval: history from 1940. The links to information science, natural language processing, information extraction, statistics, human-computer interaction. --- //MCR, JCL, SSN// [[02|« Previous]] | [[index|Index]] | [[04|Next »]]