====== L: 26/09/2020 ======

**Master in Informatics and Computing Engineering\\
Information Description, Storage and Retrieval\\
Instance: 2020/2021**
\\
---
\\

====== Lecture #1 :: 26/09/2020 ======

===== Goals =====

By the end of this class, the student should be able to:
  * Describe the content, evaluation and bibliography of the course;
  * Identify the key problems in the harvesting, organisation, processing and storage of large data collections.
  * Describe the scope of the projects to be done in the course.
  * List projects that are good examples of using and making data available.
  * List some data sources suitable to the practical work.
  * Select the right tools to collect and store the datasets
  * Characterize the datasets, identifying some of their properties
  * Select datasets suitable to the project theme

===== Topics =====

  - Presentation of the course
     * Goals
     * Program
     * Bibliography
     * Assessment
     * Plan
  - Datasets
    * Data Sources
      * Using SQL: legacy system
      * Using an API: Wikipedia
      * Using parsing: "HTML screen scraping"
      * Using an URL: obtaining of the collection directly
    * Formats
      * XML (HTML)
      * Text
      * Audio
      * RTF/ODF (OpenDocument Format)
      * PDF
      * CSV
  - Case study: using OpenRefine
    * Collection from CSV
    * Processing of data in google-refine (cleaning, processing)
    * Initial exploration of data using google-refine (transformations, enrichment)
  - Case study: using R and MySQL
    * Collection from CSV/MySQL/Web
    * Data processing in R
    * Initial exploration of the data in R
  - Case study: using Python
    * Collection from text
    * Data processing in Python
    * Initial exploration of the data using Pandas
  - Case study: using Excel and the file system
    * Collection and storage files (CSV)
    * Data processing in Excel
    * Initial exploration in Excel
  - Other approaches
    * Apache Tika
  - Common Issues
    * Encoding of characters

===== Bibliography =====

  * [[..:plan]], September 2020
  * [[..:project]], September 2020
  * [[http://www.mir2ed.org|Figure 1.2: High level software architecture of an IR system]], in Ricardo Baeza-Yates, Berthier Ribeiro-Neto, //Modern Information Retrieval: The Concepts and Technology behind Search//, Addison-Wesley Professional 2nd edition, 2011
  * [[http://www.w3.org/2000/Talks/1206-xml2k-tbl/slide10-0.html|The Semantic Web stack]], in Tim Berners-Lee, //Semantic Web on XML//, W3C, 2000
  * Luís Torgo, //Data Mining with R: Learning with Case Studies//, Chapman & Hall/CRC, 2010, ISBN: 9781439810187
  * Leonard Richardson, //Beautiful Soup, Screen Scraping in Python//, [[http://www.crummy.com/software/BeautifulSoup/|online]], last accessed September 2020
  * OpenRefine, //A free, open source, powerful tool for working with messy data//, [[http://openrefine.org/|online]], last accessed September 2020
  * Project Jupyter, //The Jupyter Notebook//, [[http://jupyter.org/about.html|online]], last accessed September 2020
  * Reuven M. Lerner, Analyzing Data, Linux Journal, Aug 15 2016, [[http://www.linuxjournal.com/content/analyzing-data|online]], last accessed September 2020
  * Reuven M. Lerner, Pandas, Linux Journal, Aug 17 2016, [[http://www.linuxjournal.com/content/pandas|online]], last accessed September 2020
  * Wikipedia, //Character encoding//, [[http://en.wikipedia.org/wiki/Character_encoding|online]], last accessed September 2020

===== Materials =====
  * [[https://forms.gle/MC6j94bFvteRGpnh9|DAPI 20/21 :: Survey to Enrolled Students]]
  * [[https://forms.gle/7CZPz1i2nYUScSFf6|DAPI 20/21 :: Group registration]]

  * [[https://docs.google.com/presentation/d/e/2PACX-1vQq3zeFZdwmIy8mrEAKx9mGxJsoCb7xtuEMpOgLHDKOqNkkI5nx62-tdIn0jLQTTHZxsbEghTYSDk4n/pub#slide=id.p|DAPI 20/21 :: Course Presentation]]

  * [[http://ant.fe.up.pt/|ANT - Pesquisa de Informação na U.Porto]]
  * [[http://contamehistorias.pt|Conta-me Histórias Arquivo.pt]] (Prémio Arquivo.pt 2018)
  * [[https://msramalho.github.io/desarquivo/|Desarquivo]] (Prémio Arquivo.pt 2020)
  * [[http://www.transparenciahackday.org/|Transparência Hackday Portugal]]

  * [[..:datasets|Datasets]]

  * Facebook, //[[https://developers.facebook.com/|Facebook for developers]]//
  * Twitter, //[[https://dev.twitter.com/overview/api|Twitter Developer Documentation]]//
  * Google Developers, //[[https://developers.google.com/sheets/api/quickstart/python|Google Sheets API v4]]//
  * The Apache Software Foundation, //[[http://tika.apache.org/|Apache Tika - a content analysis toolkit]]//
  * The R Foundation, //[[https://www.r-project.org/|The R Project for Statistical Computing]]//

===== Tasks =====

  * Assemble practical work groups
  * Identify and discuss the themes
  * Identify the datasets to be used in the project, decide how they will be obtained and the volume of data that needs to be stored
   * Try the tools available for processing datasets
   * Characterize the dataset

===== Summary =====

  * Introduction to the course: goals, content, bibliography, assessment, practical work and plan.
  * Identification of the main problems in the search, organization, processing and storage of large datasets.
  * Scope of the practical work and project groups.
  * Exploration of data sources for the practical work.
  * Datasets. Data sources and formats.
  * Data collection and processing. Using OpenRefine, R, Python, MySQL, Excel. 
  * Obtaining datasets from the domain chosen for practical work.
  * Exploratory analysis.

 --- //MCR, JCL, SSN//

[[index|Index]] | [[02|Next »]]