4.0 Ensemble Model Training

What is Ensemble Model Training?

ANOVA.

Image source

In our Anvil workflow, our ensemble learning process at a high-level is:

  1. the training data is bootstrapped, aka randomly sampled with replacement

  2. a user-selected model, e.g. LightGBM Regressor, ChemProp Regressor, etc. is trained on the bootstrapped data

  3. Steps 1 and 2 repeat for a user-designated n number of models

  4. the predictions of each trained model are combined, e.g. averaged together

Why Train an Ensemble of Models?

For complex problems, like predicting ADMET profiles of compounds where data is sparse, training an ensemble of models increases the robustness and accuracy of the model predictions, especially compared to relying on a single model.

Further, since you’re taking an average of the preditions across n number of models, you are also able to calculate the standard deviation or uncertainty of your predictions. This is essential for evaluating how reliable the ADME model is for real world applications and performing active learning with your model.

If I’m relying on my model’s predictions to pick active compounds, how off can my predictions be?

What degree of error does my model have?

What compounds would be most informative to assay next?

Requirements

As in 02_Training_Models.ipynb, you will need:

  1. A dataset that has been processed with 01_Curate_ChEMBL_Data.ipynb.

  2. A YAML file with instructions for Anvil and specifically for ensemble model training. We will show you how to create this file in this notebook.

Overview

This notebook will walk you through how to train an ensemble of models with the Anvil workflow with the same CYP3A4 data used in 02_Training_Models.ipynb.

NOTE: For ease of demonstration on CPU, we won’t actually train the model here. We recommend training large models on GPU.

Create the YAML file

As in 02_Training_Models.ipynb, we will use a YAML file containing all the necessary information to train the ensemble. The only difference from the usual anvil recipe is the ensemble section, which is highlighted with the comment ### ENSEMBLE SECTION ###.

The main features of the ensemble section are:

  • type: the type of ensemble learning model we’re using is the CommitteeRegressor, which takes the mean of the predictions of the individual regressors to produce the final output.

  • n_models: the number of models to train in the ensemble.

  • calibration_method: there are a couple of types of calibration methods currently available in Anvil: scaling-factor and isotonic regression. To read more about each type of calibration, check out this blog post. You can also specify no calibration method by putting in null.

In the below example, we will be training a 5-model ensemble of LGBM regressors with the CYP3A4 ChEMBL data.

# This spection specifies the input data
data:
  # Specify the dataset file
  resource: ../01_Data_Curation/processed_data/processed_CYP3A4_inhibition.csv
  type: intake
  input_col: OPENADMET_SMILES
  # Specify each (1+) of the target columns, or the column that you're trying to predict
  target_cols:
  - OPENADMET_LOGAC50
  dropna: true

# Additional metadata
metadata:
  authors: Your Name
  email: youremail@email.com
  biotargets:
  - CYP3A4
  build_number: 0
  description: basic regression using a LightGBM model
  driver: sklearn
  name: lgbm_pchembl
  tag: openadmet-chembl
  tags:
  - openadmet
  - test
  - pchembl
  version: v1

# Section specifying training procedure
procedure:
# Featurization specification
  feat:
    # Using concatenated features, which combines multiple featurizers
    # here we use DescriptorFeaturizer and FingerprintFeaturizer for 2D RDKit descriptors and ECFP4 fingerprints
    # See openadmet.models.features
    type: FeatureConcatenator
    # Add parameters for the featurizer. Full description of the featurizer options are in Section 5.
    params:
      featurizers:
        DescriptorFeaturizer:
          descr_type: "desc2d"
        FingerprintFeaturizer:
          fp_type: "ecfp:4"

  # Model specification
  model:
    # Indicate model type
    # See openadmet.models.architecture for all model types
    type: LGBMRegressorModel
    # Specify model parameters
    params:
      alpha: 0.005
      learning_rate: 0.05
      n_estimators: 500

  ### !!!!!!!!!!!!!!!!!!!!!!!! ENSEMBLE SECTION !!!!!!!!!!!!!!!!!!!!!!!! ###
  # Ensemble specification
  ensemble:
    type: CommitteeRegressor
    n_models: 5
    calibration_method: scaling-factor
  ### !!!!!!!!!!!!!!!!!!!!!!!! ENSEMBLE SECTION !!!!!!!!!!!!!!!!!!!!!!!! ###

  # Specify data splits
  split:
    # Specify how data will be split
    # See openadmet.models.split
    type: ShuffleSplitter
    # Specify split parameters
    params:
      random_state: 42
      train_size: 0.7
      val_size: 0.1 # Validation set is needed for uncertainty calibration
      test_size: 0.2 # If you want to compare tree-based models with Dl models later, the test sizes should match

  # Specify training configuration
  train:
    # Specify the trainer, here SKLearnBasicTrainer as model has an sklearn interface
    # could also use SKLearnGridSearchTrainer for hyperparameter tuning
    type: SKLearnBasicTrainer


# Section specifying report generation
report:
  # Configure evaluation
  eval:
  # Generate regression metrics
  - type: RegressionMetrics
    params: {}
  # Generate regression plots & do cross validation
  - type: SKLearnRepeatedKFoldCrossValidation
    params:
      axes_labels:
      - True pAC50
      - Predicted pAC50
      max_val: 10
      min_val: 3
      pXC50: true
      n_splits: 5
      n_repeats: 5
      title: True vs Predicted pAC50 on test set
  # Generate uncertainty metrics
  - type: UncertaintyMetrics
    params:
      bins: 100
      resolution: 99
      scaled: True
  # Generate uncertainty calibration plot
  - type: UncertaintyPlots
    params: {}

The command for running anvil is exactly the same as it was before!

openadmet anvil --recipe-path anvil_ensemble.yaml --output-dir ensemble

We have already run this for you and have provided the results of the ensemble model training in ensemble/.

If training deep learning models, we highly recommend training on GPU.

With the ensemble training model output, you’ll notice that there is now a new plot: OPENADMET_LOGAC50_uncertainty-calibration-plot.png.

uncertainty.

This plot helps visualize how miscalibrated or “off” the ensemble’s predicted uncertainty intervals (derived from the calculated standard deviation) are.

If our model perfectly predicted the uncertainty intervals - in other words, if the error between \(y_{predicted}\) and \(y_{actual}\) always fell within the predicted uncertainty intervals - then the blue line would fall precisely along the dotted black line.

uncertainty labelled.

If the blue area was above the black line in the underconfidence zone, this would mean that our model’s uncertainty intervals are too large or, our model is underestimating uncertainty.

When the blue area is below the black line in the overconfidence zone, this means the uncertainty intervals are too narrow or, our model is overestimating uncertainty.

We can see our blue line does not completely fall along the dotted black line, but the shaded blue area represents the area of miscalibration, aka where our uncertainty estimation is inaccurate, is reasonably small (=0.02).

This is because in our ensemble model training Anvil workflow, we applied a scaling factor to recalibrate our uncertainty estimations to minimize the miscalibration area. This is known as uncertainty calibration and helps empircally adjust the width of your error bars to match the distribution expected from a validation set. This is done using the Uncertainty Toolbox package. Learn more by reading the documentation there!

Now let’s apply our ensemble model to some unseen compounds!

End of 04_Ensemble_Model_Training ~