====== L: 26/09/2020 ====== **Master in Informatics and Computing Engineering\\ Information Description, Storage and Retrieval\\ Instance: 2020/2021** \\ --- \\ ====== Lecture #1 :: 26/09/2020 ====== ===== Goals ===== By the end of this class, the student should be able to: * Describe the content, evaluation and bibliography of the course; * Identify the key problems in the harvesting, organisation, processing and storage of large data collections. * Describe the scope of the projects to be done in the course. * List projects that are good examples of using and making data available. * List some data sources suitable to the practical work. * Select the right tools to collect and store the datasets * Characterize the datasets, identifying some of their properties * Select datasets suitable to the project theme ===== Topics ===== - Presentation of the course * Goals * Program * Bibliography * Assessment * Plan - Datasets * Data Sources * Using SQL: legacy system * Using an API: Wikipedia * Using parsing: "HTML screen scraping" * Using an URL: obtaining of the collection directly * Formats * XML (HTML) * Text * Audio * RTF/ODF (OpenDocument Format) * PDF * CSV - Case study: using OpenRefine * Collection from CSV * Processing of data in google-refine (cleaning, processing) * Initial exploration of data using google-refine (transformations, enrichment) - Case study: using R and MySQL * Collection from CSV/MySQL/Web * Data processing in R * Initial exploration of the data in R - Case study: using Python * Collection from text * Data processing in Python * Initial exploration of the data using Pandas - Case study: using Excel and the file system * Collection and storage files (CSV) * Data processing in Excel * Initial exploration in Excel - Other approaches * Apache Tika - Common Issues * Encoding of characters ===== Bibliography ===== * [[..:plan]], September 2020 * [[..:project]], September 2020 * [[http://www.mir2ed.org|Figure 1.2: High level software architecture of an IR system]], in Ricardo Baeza-Yates, Berthier Ribeiro-Neto, //Modern Information Retrieval: The Concepts and Technology behind Search//, Addison-Wesley Professional 2nd edition, 2011 * [[http://www.w3.org/2000/Talks/1206-xml2k-tbl/slide10-0.html|The Semantic Web stack]], in Tim Berners-Lee, //Semantic Web on XML//, W3C, 2000 * Luís Torgo, //Data Mining with R: Learning with Case Studies//, Chapman & Hall/CRC, 2010, ISBN: 9781439810187 * Leonard Richardson, //Beautiful Soup, Screen Scraping in Python//, [[http://www.crummy.com/software/BeautifulSoup/|online]], last accessed September 2020 * OpenRefine, //A free, open source, powerful tool for working with messy data//, [[http://openrefine.org/|online]], last accessed September 2020 * Project Jupyter, //The Jupyter Notebook//, [[http://jupyter.org/about.html|online]], last accessed September 2020 * Reuven M. Lerner, Analyzing Data, Linux Journal, Aug 15 2016, [[http://www.linuxjournal.com/content/analyzing-data|online]], last accessed September 2020 * Reuven M. Lerner, Pandas, Linux Journal, Aug 17 2016, [[http://www.linuxjournal.com/content/pandas|online]], last accessed September 2020 * Wikipedia, //Character encoding//, [[http://en.wikipedia.org/wiki/Character_encoding|online]], last accessed September 2020 ===== Materials ===== * [[https://forms.gle/MC6j94bFvteRGpnh9|DAPI 20/21 :: Survey to Enrolled Students]] * [[https://forms.gle/7CZPz1i2nYUScSFf6|DAPI 20/21 :: Group registration]] * [[https://docs.google.com/presentation/d/e/2PACX-1vQq3zeFZdwmIy8mrEAKx9mGxJsoCb7xtuEMpOgLHDKOqNkkI5nx62-tdIn0jLQTTHZxsbEghTYSDk4n/pub#slide=id.p|DAPI 20/21 :: Course Presentation]] * [[http://ant.fe.up.pt/|ANT - Pesquisa de Informação na U.Porto]] * [[http://contamehistorias.pt|Conta-me Histórias Arquivo.pt]] (Prémio Arquivo.pt 2018) * [[https://msramalho.github.io/desarquivo/|Desarquivo]] (Prémio Arquivo.pt 2020) * [[http://www.transparenciahackday.org/|Transparência Hackday Portugal]] * [[..:datasets|Datasets]] * Facebook, //[[https://developers.facebook.com/|Facebook for developers]]// * Twitter, //[[https://dev.twitter.com/overview/api|Twitter Developer Documentation]]// * Google Developers, //[[https://developers.google.com/sheets/api/quickstart/python|Google Sheets API v4]]// * The Apache Software Foundation, //[[http://tika.apache.org/|Apache Tika - a content analysis toolkit]]// * The R Foundation, //[[https://www.r-project.org/|The R Project for Statistical Computing]]// ===== Tasks ===== * Assemble practical work groups * Identify and discuss the themes * Identify the datasets to be used in the project, decide how they will be obtained and the volume of data that needs to be stored * Try the tools available for processing datasets * Characterize the dataset ===== Summary ===== * Introduction to the course: goals, content, bibliography, assessment, practical work and plan. * Identification of the main problems in the search, organization, processing and storage of large datasets. * Scope of the practical work and project groups. * Exploration of data sources for the practical work. * Datasets. Data sources and formats. * Data collection and processing. Using OpenRefine, R, Python, MySQL, Excel. * Obtaining datasets from the domain chosen for practical work. * Exploratory analysis. --- //MCR, JCL, SSN// [[index|Index]] | [[02|Next »]]