Large-scale data science for real-world data

COM-490

This file is part of the content downloaded from Large-scale data science for real-world data.
Course summary


Module 1a
  • General Introduction to Data Science
  • Lab:
      • Jupyter environment
      • Collaborative data science with git
      • Python 3.x
      • Numpy
      • Pandas
      • Scikit-learn
      • Matplotlib


Module 1b
  • Data formats
  • Processing large data with Python
  • ⚠ One week remains before the groups must be formed (group choice).


Module 2a
  • General introduction to big data, best practices and guidelines
  • Lab
    • Distributed file systems
    • First steps toward building a Data Lake
  • ⚠ All groups must be formed! Contact us if this is not the case.


Module 2b
  • Data wrangling and querying with Hadoop
  • Lab
    • Data formats
    • Large queries


Module 2c
  • Integrating scalable data storage and map reduce processing with Hadoop
  • Lab
    • Advanced queries

⚠ We had to make an exception and reorganize the agenda, so Modules 2c and 3a have been swapped. You can find the slides for Module 3a here.

Module 3a
  • Introduction to the Spark runtime architecture
  • Lab
    • Python on Spark
    • Basic RDD manipulations.

⚠ We had to make an exception and reorganize the agenda, so Modules 2c and 3a have been swapped. You can find the materials for Module 2c here.


Module 3b

  • Spark Data Frames

Module 3c

  • Advanced Spark
  • Lab
    • Spark optimizations
    • Spark data partitioning


Module 4a
  • Introduction to data stream processing
  • Lab
    • Apache Kafka

Easter Break (no Lecture)


Module 4b
  • Advanced data stream processing concepts
  • Lab
    • Spark stream
    • Data stream windows in Spark
    • End-to-end stream analytics pipelines with Kafka and Spark


Module 4c
  • Advanced data stream processing concepts
  • Analytics on data at rest and data in motion
  • Lab
    • Spark stream
    • Data stream windows in Spark
    • End-to-end stream analytics pipelines with Kafka and Spark
    • Buidling scalable applications on real data

  • Final project
    • Tips and hints
    • Q&A office hours

  • Final project
    • Tips and hints
    • Q&A office hours