ssn

field notes

User Tools

Site Tools


teach:dapi:202021:lectures:01

L: 26/09/2020 §

Master in Informatics and Computing Engineering
Information Description, Storage and Retrieval
Instance: 2020/2021


 

Lecture #1 :: 26/09/2020 §

Goals §

By the end of this class, the student should be able to: 

  • Describe the content, evaluation and bibliography of the course; 
  • Identify the key problems in the harvesting, organisation, processing and storage of large data collections. 
  • Describe the scope of the projects to be done in the course. 
  • List projects that are good examples of using and making data available. 
  • List some data sources suitable to the practical work. 
  • Select the right tools to collect and store the datasets 
  • Characterize the datasets, identifying some of their properties 
  • Select datasets suitable to the project theme 

Topics §

  1. Presentation of the course 
    • Goals 
    • Program 
    • Bibliography 
    • Assessment 
    • Plan 
  2. Datasets 
    • Data Sources 
      • Using SQL: legacy system 
      • Using an API: Wikipedia 
      • Using parsing: “HTML screen scraping” 
      • Using an URL: obtaining of the collection directly 
    • Formats 
      • XML (HTML
      • Text 
      • Audio 
      • RTF/ODF (OpenDocument Format) 
      • PDF 
      • CSV 
  3. Case study: using OpenRefine 
    • Collection from CSV 
    • Processing of data in google-refine (cleaning, processing) 
    • Initial exploration of data using google-refine (transformations, enrichment) 
  4. Case study: using R and MySQL 
    • Collection from CSV/MySQL/Web 
    • Data processing in R 
    • Initial exploration of the data in R 
  5. Case study: using Python 
    • Collection from text 
    • Data processing in Python 
    • Initial exploration of the data using Pandas 
  6. Case study: using Excel and the file system 
    • Collection and storage files (CSV) 
    • Data processing in Excel 
    • Initial exploration in Excel 
  7. Other approaches 
    • Apache Tika 
  8. Common Issues 
    • Encoding of characters 

Bibliography §

  • DAPI Plan, September 2020 
  • Project Rules, September 2020 
  • Figure 1.2: High level software architecture of an IR system, in Ricardo Baeza-Yates, Berthier Ribeiro-Neto, Modern Information Retrieval: The Concepts and Technology behind Search, Addison-Wesley Professional 2nd edition, 2011 
  • The Semantic Web stack, in Tim Berners-Lee, Semantic Web on XML, W3C, 2000 
  • Luís Torgo, Data Mining with R: Learning with Case Studies, Chapman & Hall/CRC, 2010, ISBN: 9781439810187 
  • Leonard Richardson, Beautiful Soup, Screen Scraping in Python, online, last accessed September 2020 
  • OpenRefine, A free, open source, powerful tool for working with messy data, online, last accessed September 2020 
  • Project Jupyter, The Jupyter Notebook, online, last accessed September 2020 
  • Reuven M. Lerner, Analyzing Data, Linux Journal, Aug 15 2016, online, last accessed September 2020 
  • Reuven M. Lerner, Pandas, Linux Journal, Aug 17 2016, online, last accessed September 2020 
  • Wikipedia, Character encoding, online, last accessed September 2020 

Materials §

Tasks §

  • Assemble practical work groups 
  • Identify and discuss the themes 
  • Identify the datasets to be used in the project, decide how they will be obtained and the volume of data that needs to be stored 
  • Try the tools available for processing datasets 
  • Characterize the dataset 

Summary §

  • Introduction to the course: goals, content, bibliography, assessment, practical work and plan. 
  • Identification of the main problems in the search, organization, processing and storage of large datasets. 
  • Scope of the practical work and project groups. 
  • Exploration of data sources for the practical work. 
  • Datasets. Data sources and formats. 
  • Data collection and processing. Using OpenRefine, R, Python, MySQL, Excel.  
  • Obtaining datasets from the domain chosen for practical work. 
  • Exploratory analysis. 

MCR, JCL, SSN 

Index | Next » 

teach/dapi/202021/lectures/01.txt · Last modified: 2020/10/01 22:45 by ssn

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki