Result and Plots

Datasets: TYK2, USP7, MPRO, D2R
Kernels: RBF Kernel, RQ Kernel, Matern Kernel, Linear Kernel, Tanimoto Kernel
Fingerprint: MACCS, ECFP, ChemBERTa tokens
Protocols: random, ucb-alternate, ucb-sandwich, ucb-explore-heavy, ucb-exploit-heavy, ucb-gradual, ucb-balanced
Cycles: 0 to 10 (11 total cycles)

ECFP:

ECFP with Tanimoto Kernel

ECFP with RQ Kernel

ECFP with Linear Kernel

The most effective protocol varies by dataset:
- TYK2: UCB-Exploit-Heavy and UCB-Gradual
- USP7: UCB-Exploit-Heavy
- D2R: UCB-Gradual
- MPRO: UCB-Gradual
The magnitude of improvement over random selection varies:
- USP7 shows the largest improvements (e.g., 83.38% vs. 19.46% for top 2%)
- MPRO shows the smallest improvements (e.g., 29.10% vs. 14.55% for top 2%)

ECFP with Matern Kernel

ECFP with RBF Kernel

D2R Dataset:

Performance range: 0.220 to 0.600
Best performing: Linear and Tanimoto kernels (0.600) with ucb-gradual strategy
Worst performing: KernelRBF (0.220) with ucb-alternate strategy
Moderate performance: Matern kernel (0.400) with ucb-balanced strategy
Overall, this dataset shows lower performance compared to others

MPRO Dataset:

Performance range: 0.291 to 0.994
Best performing: Matern and KernelRBF kernels (0.994) with ucb-exploit-heavy and ucb-balanced strategies respectively
Worst performing: Linear and Tanimoto kernels (0.291) with ucb-gradual strategy
Shows the highest overall performance and the widest range of performance values
Strategies like ucb-exploit-heavy and ucb-balanced seem particularly effective here

TYK2 Dataset:

Performance range: 0.040 to 0.475
Best performing: Linear and Tanimoto kernels (0.475) with ucb-exploit-heavy strategy
Worst performing: Matern and KernelRBF kernels (0.040) with random strategy
Shows the lowest overall performance across all datasets
The random strategy performs poorly here, suggesting that more intelligent strategies are crucial for this dataset

USP7 Dataset:

Performance range: 0.556 to 0.973
Best performing: Matern and KernelRBF kernels (0.973) with ucb-alternate strategy
Worst performing: RQ kernel (0.556) with ucb-explore-heavy strategy
Shows consistently high performance across most kernels and strategies
The ucb-alternate and ucb-exploit-heavy strategies seem particularly effective here

Overall observations:

Dataset difficulty: TYK2 appears to be the most challenging dataset, while MPRO and USP7 allow for higher performance across various kernels and strategies.
Strategy effectiveness: The effectiveness of strategies varies across datasets. For instance, ucb-gradual works well for D2R but poorly for MPRO.
Kernel performance: The Matern Kernel and RBF kernels often show similar performance and are top performers in MPRO and USP7. The Linear Kernel and Tanimoto Kernel perform identically across all datasets.
Performance consistency: USP7 shows the most consistent high performance across different kernels and strategies, while other datasets show more variation.
Random strategy: Consistently performs poorly across all datasets where it was tested, highlighting the importance of more sophisticated strategies.

ChemBERTa Token :

ChemBERTa with Tanimoto Gauche

ChemBERTa with Linear Kernel

Almost similar result as Tanimoto Kernel(over all). Difference only in Selection protocols.

ChemBERTa with RBF Kernel

Abysmal result in TYK2(random performs better than all niche selection protocols) and D2R. Excelent in MPRO and USP7.

ChemBERTa with Matern Kernel

Similar to RBF Kernel

ChemBERTa with RQ Kernel

D2R Dataset:

Performance range: 0.140 to 0.440
Best performing: Tanimoto kernel (0.440) with ucb-sandwich strategy
Worst performing: KernelRBF kernel (0.140) with ucb-alternate strategy
Linear kernel shows moderate performance (0.420) with ucb-sandwich strategy
Overall, this dataset shows relatively low performance across kernels

MPRO Dataset:

Performance range: 0.242 to 0.994
Best performing: Matern and KernelRBF kernels (0.994) with ucb-exploit-heavy strategy
Worst performing: RQ kernel (0.242) with ucb-explore-heavy strategy
Linear and Tanimoto kernels show good performance (0.606) with ucb-exploit-heavy strategy
This dataset shows the highest overall performance and the widest range of values

TYK2 Dataset:

Performance range: 0.040 to 0.380
Best performing: Linear kernel (0.380) with ucb-balanced strategy
Worst performing: KernelRBF kernel (0.040) with random strategy
Matern kernel shows very poor performance (0.045) with ucb-balanced strategy
This dataset shows the lowest overall performance across all datasets

USP7 Dataset:

Performance range: 0.195 to 0.973
Best performing: Matern and KernelRBF kernels (0.973) with ucb-alternate strategy
Worst performing: RQ kernel (0.195) with random strategy
Linear and Tanimoto kernels show moderate performance (0.500) with ucb-balanced strategy
This dataset shows high performance for some kernels but also significant variation

Overall observations:

Dataset difficulty: TYK2 appears to be the most challenging dataset, while MPRO allows for the highest performance across various kernels and strategies.
Strategy effectiveness: The ucb-exploit-heavy strategy works particularly well for MPRO, while ucb-alternate is effective for USP7 with certain kernels.
Kernel performance: The Matern and KernelRBF kernels often show similar performance and are top performers in MPRO and USP7. The Linear and Tanimoto kernels show identical or very similar performance across all datasets.
Performance consistency: MPRO shows the most consistent high performance across different kernels, while other datasets show more variation.
Random strategy: Performs poorly across all datasets where it was tested, particularly for the KernelRBF on TYK2 and RQ kernel on USP7.
RQ kernel: Consistently underperforms compared to other kernels across all datasets.

The MPRO dataset seems to be the most amenable to high performance across multiple kernels and strategies.

MACCS :

RQ Kernel

Matern Kernel

Tanimoto Kernel

Linear Kernel

RBF Kernel

D2R Dataset:

Performance range: 0.240 to 0.480
Best performing: Linear and Tanimoto kernels (0.480) with ucb-balanced strategy
Worst performing: Matern kernel (0.240) with ucb-explore-heavy strategy
RQ kernel shows moderate performance (0.320) with ucb-balanced strategy
Overall, this dataset shows moderate performance across kernels

MPRO Dataset:

Performance range: 0.388 to 0.776
Best performing: KernelRBF (0.776) with ucb-exploit-heavy strategy
Worst performing: Linear and Tanimoto kernels (0.388) with ucb-alternate strategy
Matern kernel shows good performance (0.582) with ucb-gradual strategy
This dataset shows the highest overall performance among all datasets

TYK2 Dataset:

Performance range: 0.075 to 0.375
Best performing: Linear and Tanimoto kernels (0.375) with ucb-balanced strategy
Worst performing: KernelRBF (0.075) with ucb-sandwich strategy
Matern kernel shows poor performance (0.140) with ucb-sandwich strategy
This dataset shows the lowest overall performance across all datasets

USP7 Dataset:

Performance range: 0.250 to 0.723
Best performing: KernelRBF (0.723) with ucb-alternate strategy
Worst performing: RQ kernel (0.250) with ucb-explore-heavy strategy
Linear and Tanimoto kernels show good performance (0.639) with ucb-balanced strategy
This dataset shows high performance for some kernels but also significant variation

Overall observations:

Dataset difficulty: TYK2 appears to be the most challenging dataset, while MPRO allows for the highest performance across various kernels and strategies.
Strategy effectiveness: The effectiveness of strategies varies across datasets. For instance, ucb-balanced works well for D2R and USP7 with Linear and Tanimoto kernels, while ucb-exploit-heavy is effective for MPRO with KernelRBF.
Kernel performance:
- The Linear and Tanimoto kernels consistently show identical performance across all datasets.
- KernelRBF shows high variability, performing best on MPRO and USP7 but worst on TYK2.
- The Matern kernel generally underperforms compared to other kernels, except on MPRO.
Performance consistency: MPRO shows the most consistent high performance across different kernels, while other datasets, especially TYK2, show more variation.
RQ kernel: Consistently underperforms compared to other kernels across all datasets, never achieving the best performance for any dataset.
Best strategies: ucb-balanced and ucb-alternate appear more frequently among the top-performing combinations, suggesting they might be more robust across different datasets and kernels.

The MPRO dataset seems to be the most amenable to high performance across multiple kernels and strategies, while TYK2 presents the most challenges. The consistent performance of Linear and Tanimoto kernels across datasets suggests they might be more reliable choices when working with MACCS fingerprints.

Overall analysis :

Based on the three kernel performance heatmaps for MACCS, CHEMBERTA, and ECFP, overall analysis:

Dataset Performance:
- MPRO consistently shows the highest performance across all three fingerprint types.
- TYK2 consistently shows the lowest performance across all three fingerprint types.
- USP7 and D2R generally show moderate performance, with USP7 often performing better than D2R.
Kernel Performance:
- Linear and Tanimoto kernels show identical performance within each fingerprint type, suggesting they may be functionally equivalent for these tasks.
- KernelRBF (Radial Basis Function) shows high variability across datasets and fingerprint types, often achieving the highest performance (e.g., MPRO in MACCS and CHEMBERTA) but also the lowest in some cases (e.g., TYK2 in MACCS).
- Matern kernel generally underperforms compared to other kernels, especially in MACCS and CHEMBERTA.
- RQ (Rational Quadratic) kernel consistently underperforms across all datasets and fingerprint types.
Strategy Effectiveness:
- ucb-exploit-heavy and ucb-alternate strategies frequently appear among top-performing combinations, especially for MPRO and USP7 datasets.
- ucb-balanced strategy often performs well with Linear and Tanimoto kernels across different fingerprint types.
- Random strategy, when used, consistently shows poor performance, particularly evident in CHEMBERTA and ECFP for the TYK2 dataset.
Fingerprint Type Comparison:
- ECFP shows the highest peak performances, with values reaching 0.994 for MPRO dataset.
- CHEMBERTA shows similar peak performances to ECFP, also reaching 0.994 for MPRO dataset.
- MACCS shows lower peak performances compared to ECFP and CHEMBERTA, with a maximum of 0.776 for MPRO dataset.
Consistency:
- ECFP and CHEMBERTA show more consistent high performances across different kernels for well-performing datasets (MPRO and USP7).
- MACCS shows more variability in performance across kernels and datasets.
Best Overall Combinations:
- For MPRO: KernelRBF with ucb-exploit-heavy strategy performs best across all fingerprint types.
- For USP7: KernelRBF or Matern kernel with ucb-alternate strategy generally performs well, especially in CHEMBERTA and ECFP.
- For D2R and TYK2: Linear or Tanimoto kernel with ucb-balanced or ucb-gradual strategy often performs best, though performance is generally lower than for MPRO and USP7.