Programme outline
Learning objectives
- Identify and explain the 5 V’s (Volume, Velocity, Value, Variety, and Veracity) of big data.
- Perform data discovery by accessing data from multiple sources and importing various file formats into suitable data formats.
- Create and manipulate NumPy arrays, and produce visualisations using the Matplotlib and Seaborn libraries with Python.
- Conduct data profiling by obtaining descriptive statistics, checking for missing values, and visualising data with common plots.
- Conduct cross-column profiling to analyse relationships between numerical and categorical data through various bivariate and multivariate plots.
Day 1
- Characteristics of Big Data – 5 V’s, Types of Digital Data, Types of Database
- Role of Data Exploration in Big Data
- Data Discovery – Reading common file formats as DataFrame, reading other file formats, database in Python and Sqlite3, access data from AWS S3, access data from website with API.
Day 2
- Creating NumPy arrays, plotting using Matplotlib and Seaborn libraries.
- Data Profiling – Column Profiling – Get descriptive statistics, check for missing values, type of variables, unique values, common plots for numerical and categorical data
- Data Profiling – Cross-column Profiling – Common plots for numerical vs numerical, numerical vs categorical, bivariate categorical plot, bivariate comparison scatter plots, multivariate categorical plot
Mode of assessment
- Quiz