Result and Plots
Datasets: TYK2, USP7, MPRO, D2R
Kernels: RBF Kernel, RQ Kernel, Matern Kernel, Linear Kernel, Tanimoto Kernel
Fingerprint: MACCS, ECFP, ChemBERTa tokens
Protocols: random, ucb-alternate, ucb-sandwich, ucb-explore-heavy, ucb-exploit-heavy, ucb-gradual, ucb-balanced
Cycles: 0 to 10 (11 total cycles)
ECFP:
ECFP with Tanimoto Kernel

ECFP with RQ Kernel

ECFP with Linear Kernel

- The most effective protocol varies by dataset:
- TYK2: UCB-Exploit-Heavy and UCB-Gradual
- USP7: UCB-Exploit-Heavy
- D2R: UCB-Gradual
- MPRO: UCB-Gradual
- The magnitude of improvement over random selection varies:
- USP7 shows the largest improvements (e.g., 83.38% vs. 19.46% for top 2%)
- MPRO shows the smallest improvements (e.g., 29.10% vs. 14.55% for top 2%)
ECFP with Matern Kernel

ECFP with RBF Kernel


- D2R Dataset:
- Performance range: 0.220 to 0.600
- Best performing: Linear and Tanimoto kernels (0.600) with ucb-gradual strategy
- Worst performing: KernelRBF (0.220) with ucb-alternate strategy
- Moderate performance: Matern kernel (0.400) with ucb-balanced strategy
- Overall, this dataset shows lower performance compared to others
- MPRO Dataset:
- Performance range: 0.291 to 0.994
- Best performing: Matern and KernelRBF kernels (0.994) with ucb-exploit-heavy and ucb-balanced strategies respectively
- Worst performing: Linear and Tanimoto kernels (0.291) with ucb-gradual strategy
- Shows the highest overall performance and the widest range of performance values
- Strategies like ucb-exploit-heavy and ucb-balanced seem particularly effective here
- TYK2 Dataset:
- Performance range: 0.040 to 0.475
- Best performing: Linear and Tanimoto kernels (0.475) with ucb-exploit-heavy strategy
- Worst performing: Matern and KernelRBF kernels (0.040) with random strategy
- Shows the lowest overall performance across all datasets
- The random strategy performs poorly here, suggesting that more intelligent strategies are crucial for this dataset
- USP7 Dataset:
- Performance range: 0.556 to 0.973
- Best performing: Matern and KernelRBF kernels (0.973) with ucb-alternate strategy
- Worst performing: RQ kernel (0.556) with ucb-explore-heavy strategy
- Shows consistently high performance across most kernels and strategies
- The ucb-alternate and ucb-exploit-heavy strategies seem particularly effective here
Overall observations:
-
Dataset difficulty: TYK2 appears to be the most challenging dataset, while MPRO and USP7 allow for higher performance across various kernels and strategies.
-
Strategy effectiveness: The effectiveness of strategies varies across datasets. For instance, ucb-gradual works well for D2R but poorly for MPRO.
-
Kernel performance: The Matern Kernel and RBF kernels often show similar performance and are top performers in MPRO and USP7. The Linear Kernel and Tanimoto Kernel perform identically across all datasets.
-
Performance consistency: USP7 shows the most consistent high performance across different kernels and strategies, while other datasets show more variation.
-
Random strategy: Consistently performs poorly across all datasets where it was tested, highlighting the importance of more sophisticated strategies.
ChemBERTa Token :
ChemBERTa with Tanimoto Gauche

ChemBERTa with Linear Kernel

Almost similar result as Tanimoto Kernel(over all). Difference only in Selection protocols.
ChemBERTa with RBF Kernel

Abysmal result in TYK2(random performs better than all niche selection protocols) and D2R. Excelent in MPRO and USP7.
ChemBERTa with Matern Kernel

Similar to RBF Kernel
ChemBERTa with RQ Kernel

.png)
- D2R Dataset:
- Performance range: 0.140 to 0.440
- Best performing: Tanimoto kernel (0.440) with ucb-sandwich strategy
- Worst performing: KernelRBF kernel (0.140) with ucb-alternate strategy
- Linear kernel shows moderate performance (0.420) with ucb-sandwich strategy
- Overall, this dataset shows relatively low performance across kernels
- MPRO Dataset:
- Performance range: 0.242 to 0.994
- Best performing: Matern and KernelRBF kernels (0.994) with ucb-exploit-heavy strategy
- Worst performing: RQ kernel (0.242) with ucb-explore-heavy strategy
- Linear and Tanimoto kernels show good performance (0.606) with ucb-exploit-heavy strategy
- This dataset shows the highest overall performance and the widest range of values
- TYK2 Dataset:
- Performance range: 0.040 to 0.380
- Best performing: Linear kernel (0.380) with ucb-balanced strategy
- Worst performing: KernelRBF kernel (0.040) with random strategy
- Matern kernel shows very poor performance (0.045) with ucb-balanced strategy
- This dataset shows the lowest overall performance across all datasets
- USP7 Dataset:
- Performance range: 0.195 to 0.973
- Best performing: Matern and KernelRBF kernels (0.973) with ucb-alternate strategy
- Worst performing: RQ kernel (0.195) with random strategy
- Linear and Tanimoto kernels show moderate performance (0.500) with ucb-balanced strategy
- This dataset shows high performance for some kernels but also significant variation
Overall observations:
-
Dataset difficulty: TYK2 appears to be the most challenging dataset, while MPRO allows for the highest performance across various kernels and strategies.
-
Strategy effectiveness: The ucb-exploit-heavy strategy works particularly well for MPRO, while ucb-alternate is effective for USP7 with certain kernels.
-
Kernel performance: The Matern and KernelRBF kernels often show similar performance and are top performers in MPRO and USP7. The Linear and Tanimoto kernels show identical or very similar performance across all datasets.
-
Performance consistency: MPRO shows the most consistent high performance across different kernels, while other datasets show more variation.
-
Random strategy: Performs poorly across all datasets where it was tested, particularly for the KernelRBF on TYK2 and RQ kernel on USP7.
-
RQ kernel: Consistently underperforms compared to other kernels across all datasets.
The MPRO dataset seems to be the most amenable to high performance across multiple kernels and strategies.
MACCS :
RQ Kernel

Matern Kernel

Tanimoto Kernel

Linear Kernel

RBF Kernel


- D2R Dataset:
- Performance range: 0.240 to 0.480
- Best performing: Linear and Tanimoto kernels (0.480) with ucb-balanced strategy
- Worst performing: Matern kernel (0.240) with ucb-explore-heavy strategy
- RQ kernel shows moderate performance (0.320) with ucb-balanced strategy
- Overall, this dataset shows moderate performance across kernels
- MPRO Dataset:
- Performance range: 0.388 to 0.776
- Best performing: KernelRBF (0.776) with ucb-exploit-heavy strategy
- Worst performing: Linear and Tanimoto kernels (0.388) with ucb-alternate strategy
- Matern kernel shows good performance (0.582) with ucb-gradual strategy
- This dataset shows the highest overall performance among all datasets
- TYK2 Dataset:
- Performance range: 0.075 to 0.375
- Best performing: Linear and Tanimoto kernels (0.375) with ucb-balanced strategy
- Worst performing: KernelRBF (0.075) with ucb-sandwich strategy
- Matern kernel shows poor performance (0.140) with ucb-sandwich strategy
- This dataset shows the lowest overall performance across all datasets
- USP7 Dataset:
- Performance range: 0.250 to 0.723
- Best performing: KernelRBF (0.723) with ucb-alternate strategy
- Worst performing: RQ kernel (0.250) with ucb-explore-heavy strategy
- Linear and Tanimoto kernels show good performance (0.639) with ucb-balanced strategy
- This dataset shows high performance for some kernels but also significant variation
Overall observations:
-
Dataset difficulty: TYK2 appears to be the most challenging dataset, while MPRO allows for the highest performance across various kernels and strategies.
-
Strategy effectiveness: The effectiveness of strategies varies across datasets. For instance, ucb-balanced works well for D2R and USP7 with Linear and Tanimoto kernels, while ucb-exploit-heavy is effective for MPRO with KernelRBF.
-
Kernel performance:
- The Linear and Tanimoto kernels consistently show identical performance across all datasets.
- KernelRBF shows high variability, performing best on MPRO and USP7 but worst on TYK2.
- The Matern kernel generally underperforms compared to other kernels, except on MPRO.
-
Performance consistency: MPRO shows the most consistent high performance across different kernels, while other datasets, especially TYK2, show more variation.
-
RQ kernel: Consistently underperforms compared to other kernels across all datasets, never achieving the best performance for any dataset.
-
Best strategies: ucb-balanced and ucb-alternate appear more frequently among the top-performing combinations, suggesting they might be more robust across different datasets and kernels.
The MPRO dataset seems to be the most amenable to high performance across multiple kernels and strategies, while TYK2 presents the most challenges. The consistent performance of Linear and Tanimoto kernels across datasets suggests they might be more reliable choices when working with MACCS fingerprints.
Overall analysis :
Based on the three kernel performance heatmaps for MACCS, CHEMBERTA, and ECFP, overall analysis:
-
Dataset Performance:
- MPRO consistently shows the highest performance across all three fingerprint types.
- TYK2 consistently shows the lowest performance across all three fingerprint types.
- USP7 and D2R generally show moderate performance, with USP7 often performing better than D2R.
-
Kernel Performance:
- Linear and Tanimoto kernels show identical performance within each fingerprint type, suggesting they may be functionally equivalent for these tasks.
- KernelRBF (Radial Basis Function) shows high variability across datasets and fingerprint types, often achieving the highest performance (e.g., MPRO in MACCS and CHEMBERTA) but also the lowest in some cases (e.g., TYK2 in MACCS).
- Matern kernel generally underperforms compared to other kernels, especially in MACCS and CHEMBERTA.
- RQ (Rational Quadratic) kernel consistently underperforms across all datasets and fingerprint types.
-
Strategy Effectiveness:
- ucb-exploit-heavy and ucb-alternate strategies frequently appear among top-performing combinations, especially for MPRO and USP7 datasets.
- ucb-balanced strategy often performs well with Linear and Tanimoto kernels across different fingerprint types.
- Random strategy, when used, consistently shows poor performance, particularly evident in CHEMBERTA and ECFP for the TYK2 dataset.
-
Fingerprint Type Comparison:
- ECFP shows the highest peak performances, with values reaching 0.994 for MPRO dataset.
- CHEMBERTA shows similar peak performances to ECFP, also reaching 0.994 for MPRO dataset.
- MACCS shows lower peak performances compared to ECFP and CHEMBERTA, with a maximum of 0.776 for MPRO dataset.
-
Consistency:
- ECFP and CHEMBERTA show more consistent high performances across different kernels for well-performing datasets (MPRO and USP7).
- MACCS shows more variability in performance across kernels and datasets.
-
Best Overall Combinations:
- For MPRO: KernelRBF with ucb-exploit-heavy strategy performs best across all fingerprint types.
- For USP7: KernelRBF or Matern kernel with ucb-alternate strategy generally performs well, especially in CHEMBERTA and ECFP.
- For D2R and TYK2: Linear or Tanimoto kernel with ucb-balanced or ucb-gradual strategy often performs best, though performance is generally lower than for MPRO and USP7.