Table of Contents
Mini-teste #1 study guide §
Master in Informatics and Computing Engineering
Information Description, Storage and Retrieval
Instance: 2020/2021
—
¶
The mini-test is planed to be answered in Moodle, but this is still uncertain. Nonetheless, this guide is useful for you to organize your study. ¶
This is a set of recommendations about the topics, available materials, and references for Mini-test #1.
The mini-test has an estimated duration of 90 minutes and some reference materials are available on the desktop machine. The mini-test is answered in Moodle and includes multiple-choice questions and short answer questions. There are some questions concerning the student projects, more specifically on Milestone #2. ¶
The subject of the mini-test is Information Retrieval. ¶
Topics §
Some topics for which there will be questions: ¶
- Concepts: information need, search task, collection, query, results list; ¶
- Search engines, indexing and retrieval; ¶
- Building inverted indexes; ¶
- Vector model: tf and idf calculations to compose the weight of a term in a document; ¶
- Vector model: calculate the score of a document for a query; ¶
- Evaluation: calculate recall and precision, draw P versus R curves (with average interpolated precision), calculate MAP; ¶
- Web retrieval and link analysis: PageRank, hubs and authorities. ¶
Some detailed references §
In the following, BY refers to Modern Information Retrieval, by Baeza-Yates and Ribeiro-Neto; Manning refers to Introduction to Information Retrieval, by Manning et al.. ¶
IR tasks and systems §
Information retrieval vs data retrieval, modules in a IR system ¶
Questions: ¶
- What is the difference between information retrieval and data retrieval? ¶
- Give examples of IR and data retrieval systems. ¶
- Give some examples of retrieval tasks evaluated in TREC. ¶
- What are the modules of an IR system? ¶
Ref: BY, Chap. 1 (Intro)
Ref: TREC tracks
Ref: The Anatomy..., Brin & Page ¶
IR concepts §
Concepts: document, information need, relevance, bag of words, inverted index, postings list, term pre-processing. ¶
Questions: ¶
- What is… a document, a collection, a term, a bag of words? ¶
- Define stemming. ¶
- What is… an inverted index, a vocabulary, a postings list? ¶
- What is… an information need, a query, a results list? ¶
- What is a relevant result in a results list? ¶
Ref: Manning, Chap. 1 (Boolean Retrieval) ¶
Vector model §
Term weighting, tf, df, cf, idf, vector model, ranking in the vector model ¶
Questions: ¶
- What is the bag of words model for a document? ¶
- What is… term frequency, collection frequency, document frequency, inverse document frequency? ¶
- How do you calculate tf-idf weights? ¶
- How do you rank documents in the vector model? ¶
Exercises: look at Exercises 6.8, 6.9, 6.10, 6.11, 6.15, 6.16, 6.17 and Examples 6.2, 6.3, 6.4 ¶
Ref: Manning, Chap. 2 (The term vocabulary and postings lists) (2.2) and Chap. 6 (Scoring, term weighting and the vector space model) (6.2, 6.3) ¶
Evaluation §
Precision, recall, P-R curves, MAP, reference collections, relevance judgements ¶
Questions: ¶
- What is… precision, recall, interpolated precision? ¶
- What is… precision at k, R-precision? ¶
- Name the components of a test collection. ¶
- Why is a set of relevance judgements considered a “ground truth” for IR? ¶
- Draw a precision-recall curve for capturing the evolution of precision in the ranked list of results for a query. ¶
- What is an average 11-point precision-recall graph for a set of queries? ¶
- What is MAP, and do you calculate it for a set of queries in a test collection? ¶
Exercises: look at Exercises 8.1, 8.4, 8.8, 8.9 ¶
Ref: Manning, Chap. 8
Ref: TREC pages: http://trec.nist.gov/ ¶
Web search §
Web information needs, the bowtie model, web search vs enterprise search, multimedia content, ranking functions and ranking signals ¶
Questions: ¶
- What are informational, transactional and navigational information needs? ¶
- Name some differences between web search and enterprise search. ¶
- How do you index images? ¶
- Give examples of ranking signals used by search engines. ¶
- What are the SCC, IN and OUT components in the view of the web as a bowtie? ¶
Examples: look at Manning. ¶
Ref: Manning, Chap. 19
¶
Link analysis §
Web ranking, anchor text, PageRank, hubs and authorities ¶
Questions: ¶
- What are in-links and out-links for a web page? ¶
- How is anchor text used in web search? ¶
- Calculate PageRank values for a set of linked documents. ¶
- Calculate Hub and Authority values for a set of linked documents. ¶
Ref: Manning, Chap. 21 ¶
– MCR + JCL + SSN ¶