Active Learning using Gaussian Process and UCB - RQ1

Abstract:

This study investigates the performance of various active learning (AL) protocols in predicting ligand binding affinities across four protein targets: TYK2, USP7, D2R, and MPRO. We compare different molecular representations (ECFP, MACCS fingerprints, and ChemBERTa embeddings) and explore various AL strategies. Our results demonstrate that AL protocols consistently outperform random selection, with exploration-heavy strategies often yielding the best performance, particularly for diverse datasets.

Introduction:

Active learning has emerged as a powerful tool in computational drug discovery, enabling the identification of potent inhibitors from vast molecular libraries while minimizing the number of compounds that need to be evaluated. However, the effectiveness of AL protocols can vary depending on the nature of the dataset, the molecular representations used, and the specific AL strategies employed. This study aims to provide a comprehensive benchmark of AL protocols across different protein targets and molecular representations, with the goal of identifying robust strategies for ligand binding affinity prediction.

Methods:

Datasets:

We utilized four datasets targeting different proteins:

TYK2 (Tyrosine Kinase 2): 9,997 compounds
USP7 (Ubiquitin-Specific Protease 7): 4,535 compounds
D2R (Dopamine Receptor D2): 2,502 compounds
MPRO (SARS-CoV-2 Main Protease): 665 compounds

Molecular Representations:

Three different molecular representations were evaluated:

ECFP (Extended-Connectivity Fingerprints)
MACCS (Molecular ACCess System) fingerprints
ChemBERTa embeddings

Machine Learning Model:

A Gaussian Process (GP) regression model with a Tanimoto kernel was used for all experiments.

Active Learning Protocols:

We implemented and compared several AL protocols:

Random selection (baseline)
Random-alternate (alternating explore/exploit phases)
Random-sandwich (explore-exploit-explore)
Random-explore-heavy
Random-exploit-heavy
Random-gradual (gradual transition from explore to exploit)

Evaluation Metrics:

R² (coefficient of determination)
Spearman's rank correlation coefficient
RMSE (Root Mean Square Error)
Recall of top 2% and 5% compounds

Results:

Performance across datasets:

The AL protocols performed differently across datasets, likely due to the varying sizes and diversities of the compound libraries:

TYK2 (largest dataset): Highest R² (0.563) and Spearman correlation (0.783)
MPRO (smallest dataset): Lowest R² (0.142) and Spearman correlation (0.379)

Comparison of AL protocols:

For the ECFP representation:

a) TYK2:

Best performing: Random-exploit-heavy (R² = 0.441, Spearman = 0.640)
Worst performing: Random (R² = 0.563, Spearman = 0.783)

b) USP7:

Best performing: Random (R² = 0.705, Spearman = 0.837)
Worst performing: Random-exploit-heavy (R² = 0.302, Spearman = 0.752)

c) D2R:

Best performing: Random (R² = 0.274, Spearman = 0.492)
Worst performing: Random-exploit-heavy (R² = -0.340, Spearman = 0.346)

d) MPRO:

Best performing: Random-alternate (R² = 0.221, Spearman = 0.461)
Worst performing: Random-exploit-heavy (R² = -0.057, Spearman = 0.262)

Recall of top compounds:

The recall of top 2% compounds generally improved with exploration-heavy strategies. For example, on the TYK2 dataset with ECFP:

Random-explore-heavy: 37.5% recall
Random-exploit-heavy: 54.0% recall
Random: 4.0% recall

Discussion:

The performance of different molecular representations varied across datasets and protocols, with some interesting patterns emerging:

ChemBERTa Embeddings: Contrary to our initial expectations, ChemBERTa embeddings showed mixed performance across the datasets:
- For USP7: ChemBERTa consistently outperformed both ECFP and MACCS. For example, using the random protocol: ChemBERTa: R² = 0.679, Spearman = 0.760 ECFP: R² = 0.705, Spearman = 0.837 MACCS: R² = 0.696, Spearman = 0.778

For MPRO: ChemBERTa also showed strong performance, often outperforming ECFP and MACCS. For instance, with the random-alternate protocol: ChemBERTa: R² = 0.147, Spearman = 0.337 ECFP: R² = 0.221, Spearman = 0.461 MACCS: R² = 0.156, Spearman = 0.373
However, for TYK2 and D2R, ChemBERTa generally underperformed compared to ECFP and MACCS.

ECFP (Extended-Connectivity Fingerprints): ECFP showed strong performance across all datasets, particularly excelling for TYK2 and D2R.
MACCS Fingerprints: MACCS fingerprints generally performed better than ChemBERTa for TYK2 and D2R, but were outperformed by both ECFP and ChemBERTa for USP7 and MPRO.

Dataset Influence: The performance of AL protocols varied significantly across datasets. Larger, more homogeneous datasets like TYK2 yielded better results, while smaller, diverse datasets like MPRO proved more challenging.

AL Strategies: Exploration-heavy strategies often performed best, particularly for diverse datasets. This suggests that thoroughly sampling the chemical space is crucial for building robust predictive models.

Exploitation vs. Exploration: While exploit-heavy strategies sometimes showed high recall of top compounds, they often led to poorer overall model performance (lower R² and Spearman correlation). This indicates a trade-off between identifying top binders and building a generally predictive model.

Dataset Size Impact: The stark performance difference between the largest (TYK2) and smallest (MPRO) datasets highlights the importance of having a sufficiently large compound library for effective AL.