Large-scale data science for real-world data

COM-490

This file is part of the content downloaded from Large-scale data science for real-world data.

Course summary

Module 1a

General Introduction to Data Science
Lab:
- - Jupyter environment
  - Collaborative data science with git
  - Python 3.x
  - Numpy
  - Pandas
  - Scikit-learn
  - Matplotlib

Module 1b

Data formats
Processing large data with Python
⚠ One week remains before the groups must be formed (group choice).

Module 2a

General introduction to big data, best practices and guidelines
Lab
- Distributed file systems
- First steps toward building a Data Lake
⚠ All groups must be formed! Contact us if this is not the case.

Module 2b

Data wrangling and querying with Hadoop
Lab
- Data formats
- Large queries

Module 2c

Integrating scalable data storage and map reduce processing with Hadoop
Lab
- Advanced queries

⚠ We had to make an exception and reorganize the agenda, so Modules 2c and 3a have been swapped. You can find the slides for Module 3a here.

Module 3a

Introduction to the Spark runtime architecture
Lab
- Python on Spark
- Basic RDD manipulations.

⚠ We had to make an exception and reorganize the agenda, so Modules 2c and 3a have been swapped. You can find the materials for Module 2c here.

Module 3b

Spark Data Frames

Module 3c

Advanced Spark
Lab
- Spark optimizations
- Spark data partitioning

Module 4a

Introduction to data stream processing
Lab
- Apache Kafka

Easter Break (no Lecture)

(Text and media area)

Module 4b

Advanced data stream processing concepts
Lab
- Spark stream
- Data stream windows in Spark
- End-to-end stream analytics pipelines with Kafka and Spark

Module 4c

Advanced data stream processing concepts
Analytics on data at rest and data in motion
Lab
- Spark stream
- Data stream windows in Spark
- End-to-end stream analytics pipelines with Kafka and Spark
- Buidling scalable applications on real data

Final project
- Tips and hints
- Q&A office hours

Final project
- Tips and hints
- Q&A office hours