Modeling lab

CH-315

Project description and explanation of Supervised ML for Gas Adsorption in MOFs

This page is part of the content downloaded from Project description and explanation of Supervised ML for Gas Adsorption in MOFs on Wednesday, 25 December 2024, 15:27. Note that some content and any files larger than 50 MB are not downloaded.

Page content

The project description

Github Readme

All the material that you need for the exercises is available on Github

We describe there two different ways in which you can run the exercises (and we recommend the first way, running it on your own computer).

Getting Started with the project

The following content is a introduction that help you with the project.

Quickstart sklearn

In this exercise, we will use the most popular Python library for machine learning---sklearn. We encourage you to visit the excellent documentation in case you have question about any function (and of course, just try to use SHIFT-TAB in Jupyter to get the docstrings). 

All supervised or unsupervised algorithms in sklearn use the same Estimator API (about which a paper was written, https://arxiv.org/abs/1309.0238, which we encourage you to read) which have two main methods : fit(X,y) and predict(X). 

So the general use of any estimator, be it a random forest or just a dummy model, is that you create an instance of your estimator 

estimator_instance = Estimator () # e.g., rf = RandomForestRegressor()

Then, you can fit this estimator with the fit method

estimator_instance.fit(X,y) # e.g., rf.fit(X,y) 

After fitting, you can use it for prediction 

y_predict = estimator_instance.predict(X) # e.g. y_predict = rf.predict(X) 

Another class of objects are Transformers, those can take data and transform it. For this reason they also have the transform(X) and fit_transform(X) methods: 

transformer_instance = Transformer() # e.g., scaler = StandardScaler()

transformer.fit(X) # scaler.fit(X) 

X_train = transformer.transform(X_train) # X_train = transformer.transform(X_train)

X_test = transformer.transform(X_test) # X_test = transformer.transform(X_test)

The advantage of using the sklearn objects (in contrast to doing it by hand) is that they can be efficiently chained into pipelines that can all be treated in the same way.  In this way, it is also often easier to avoid data leakage: For example, for the scaling operation you should always use the mean and the standard deviation of the training set. Sklearn will remember this if you call the .fit() method on your estimator and apply the correct mean and standard deviation if you then call .transform() (or, predict() in a pipeline).

 

Kaggle Challenge

We set up a Kaggle challenge for the course for the course in 2020.  Kaggle is a platform that you might find useful after the course to practice your data science and machine learning skills.