Modeling lab
CH-315
Project description and explanation of Supervised ML for Gas Adsorption in MOFs
Page content
The project description

All the material that you need for the exercises is available on Github.
We describe there two different ways in which you can run the exercises (and we recommend the first way, running it on your own computer).
Getting Started with the project
The following content is a introduction that help you with the project.Quickstart sklearn
In this exercise, we will use the most popular Python library for machine learning---sklearn. We encourage you to visit the excellent documentation in case you have question about any function (and of course, just try to use SHIFT-TAB in Jupyter to get the docstrings).
All supervised or unsupervised algorithms in sklearn use the same Estimator API (about which a paper was written, https://arxiv.org/abs/1309.0238, which we encourage you to read) which have two main methods : fit(X,y) and predict(X).
So the general use of any estimator, be it a random forest or just a dummy model, is that you create an instance of your estimator
estimator_instance = Estimator () # e.g., rf = RandomForestRegressor()
Then, you can fit this estimator with the fit method
estimator_instance.fit(X,y) # e.g., rf.fit(X,y)
After fitting, you can use it for prediction
y_predict = estimator_instance.predict(X) # e.g. y_predict = rf.predict(X)
Another class of objects are Transformers, those can take data and transform it. For this reason they also have the transform(X) and fit_transform(X) methods:
transformer_instance = Transformer() # e.g., scaler = StandardScaler()
transformer.fit(X) # scaler.fit(X)
X_train = transformer.transform(X_train) # X_train = transformer.transform(X_train)
X_test = transformer.transform(X_test) # X_test = transformer.transform(X_test)
The advantage of using the sklearn objects (in contrast to doing it by hand) is that they can be efficiently chained into pipelines that can all be treated in the same way. In this way, it is also often easier to avoid data leakage: For example, for the scaling operation you should always use the mean and the standard deviation of the training set. Sklearn will remember this if you call the .fit() method on your estimator and apply the correct mean and standard deviation if you then call .transform() (or, predict() in a pipeline).
Kaggle Challenge
We set up a Kaggle challenge for the course for the course in 2020. Kaggle is a platform that you might find useful after the course to practice your data science and machine learning skills.