Table of Contents

L: 26/09/2020 §

Master in Informatics and Computing Engineering
Information Description, Storage and Retrieval
Instance: 2020/2021


 

Lecture #1 :: 26/09/2020 §

Goals §

By the end of this class, the student should be able to: 

Topics §

  1. Presentation of the course 
    • Goals 
    • Program 
    • Bibliography 
    • Assessment 
    • Plan 
  2. Datasets 
    • Data Sources 
      • Using SQL: legacy system 
      • Using an API: Wikipedia 
      • Using parsing: “HTML screen scraping” 
      • Using an URL: obtaining of the collection directly 
    • Formats 
      • XML (HTML
      • Text 
      • Audio 
      • RTF/ODF (OpenDocument Format) 
      • PDF 
      • CSV 
  3. Case study: using OpenRefine 
    • Collection from CSV 
    • Processing of data in google-refine (cleaning, processing) 
    • Initial exploration of data using google-refine (transformations, enrichment) 
  4. Case study: using R and MySQL 
    • Collection from CSV/MySQL/Web 
    • Data processing in R 
    • Initial exploration of the data in R 
  5. Case study: using Python 
    • Collection from text 
    • Data processing in Python 
    • Initial exploration of the data using Pandas 
  6. Case study: using Excel and the file system 
    • Collection and storage files (CSV) 
    • Data processing in Excel 
    • Initial exploration in Excel 
  7. Other approaches 
    • Apache Tika 
  8. Common Issues 
    • Encoding of characters 

Bibliography §

Materials §

Tasks §

Summary §

MCR, JCL, SSN 

Index | Next »