Large-scale data science for real-world data
COM-490
This file is part of the content downloaded from Large-scale data science for real-world data.
- Announcements (Forum)
- Gitlab for dslab (URL)
- Group choice (Group choice)
- Compute environment (EPFL VPN required) (URL)
Module 1a
- General Introduction to Data Science
- Lab:
- Jupyter environment
- Collaborative data science with git
- Python 3.x
- Numpy
- Pandas
- Scikit-learn
- Matplotlib
- Slides (File)
- Solutions to exercises of module-1a are available.... (Text and media area)
- Lecture Video Recording (week1) (URL)
Module 1b
- Data formats
- Processing large data with Python
- ⚠ One week remains before the groups must be formed (group choice).
Module 2a
- General introduction to big data, best practices and guidelines
- Lab
- Distributed file systems
- First steps toward building a Data Lake
- ⚠ All groups must be formed! Contact us if this is not the case.
Module 2b
- Data wrangling and querying with Hadoop
- Lab
- Data formats
- Large queries
Module 2c
- Integrating scalable data storage and map reduce processing with Hadoop
- Lab
- Advanced queries
⚠ We had to make an exception and reorganize the agenda, so Modules 2c and 3a have been swapped. You can find the slides for Module 3a here.
Module 3a
- Introduction to the Spark runtime architecture
- Lab
- Python on Spark
- Basic RDD manipulations.
⚠ We had to make an exception and reorganize the agenda, so Modules 2c
and 3a have been swapped. You can find the materials for Module 2c here.
Module 3b
- Spark Data Frames
Module 3c
- Advanced Spark
- Lab
- Spark optimizations
- Spark data partitioning
Module 4a
- Introduction to data stream processing
- Lab
- Apache Kafka
- Apache Kafka
Easter Break (no Lecture)
Module 4b
- Advanced data stream processing concepts
- Lab
- Spark stream
- Data stream windows in Spark
- End-to-end stream analytics pipelines with Kafka and Spark
- Spark stream
Module 4c
- Advanced data stream processing concepts
- Analytics on data at rest and data in motion
- Lab
- Spark stream
- Data stream windows in Spark
- End-to-end stream analytics pipelines with Kafka and Spark
- Buidling scalable applications on real data
- Spark stream
- Final project
- Tips and hints
- Q&A office hours
- Final project
- Tips and hints
- Q&A office hours