Result and Plots

Datasets: TYK2, USP7, MPRO, D2R
Kernels: RBF Kernel, RQ Kernel, Matern Kernel, Linear Kernel, Tanimoto Kernel
Fingerprint: MACCS, ECFP, ChemBERTa tokens
Protocols: random, ucb-alternate, ucb-sandwich, ucb-explore-heavy, ucb-exploit-heavy, ucb-gradual, ucb-balanced
Cycles: 0 to 10 (11 total cycles)

ECFP:

ECFP with Tanimoto Kernel

ECFP with Tanimoto Kernel.png

ECFP with RQ Kernel

Pasted image 20240924131742.png

ECFP with Linear Kernel

ECFP with Linear Kernel.png

ECFP with Matern Kernel

ECFP with Matern Kernel.png

ECFP with RBF Kernel

ECFP with RBF Kernel.png

Kernel Performance ECFP.png|700

  1. D2R Dataset:
  1. MPRO Dataset:
  1. TYK2 Dataset:
  1. USP7 Dataset:

Overall observations:

  1. Dataset difficulty: TYK2 appears to be the most challenging dataset, while MPRO and USP7 allow for higher performance across various kernels and strategies.

  2. Strategy effectiveness: The effectiveness of strategies varies across datasets. For instance, ucb-gradual works well for D2R but poorly for MPRO.

  3. Kernel performance: The Matern Kernel and RBF kernels often show similar performance and are top performers in MPRO and USP7. The Linear Kernel and Tanimoto Kernel perform identically across all datasets.

  4. Performance consistency: USP7 shows the most consistent high performance across different kernels and strategies, while other datasets show more variation.

  5. Random strategy: Consistently performs poorly across all datasets where it was tested, highlighting the importance of more sophisticated strategies.

ChemBERTa Token :

ChemBERTa with Tanimoto Gauche

ChemBERTa with Tanimoto Kernel.png

ChemBERTa with Linear Kernel

ChemBERTa with Linear Kernel.png

Almost similar result as Tanimoto Kernel(over all). Difference only in Selection protocols.

ChemBERTa with RBF Kernel

Pasted image 20240924131930.png

Abysmal result in TYK2(random performs better than all niche selection protocols) and D2R. Excelent in MPRO and USP7.

ChemBERTa with Matern Kernel

Pasted image 20240924131917.png

Similar to RBF Kernel

ChemBERTa with RQ Kernel

Pasted image 20240924132003.png

chemberta_heatmap_with_protocols (1).png|700

  1. D2R Dataset:
  1. MPRO Dataset:
  1. TYK2 Dataset:
  1. USP7 Dataset:

Overall observations:

  1. Dataset difficulty: TYK2 appears to be the most challenging dataset, while MPRO allows for the highest performance across various kernels and strategies.

  2. Strategy effectiveness: The ucb-exploit-heavy strategy works particularly well for MPRO, while ucb-alternate is effective for USP7 with certain kernels.

  3. Kernel performance: The Matern and KernelRBF kernels often show similar performance and are top performers in MPRO and USP7. The Linear and Tanimoto kernels show identical or very similar performance across all datasets.

  4. Performance consistency: MPRO shows the most consistent high performance across different kernels, while other datasets show more variation.

  5. Random strategy: Performs poorly across all datasets where it was tested, particularly for the KernelRBF on TYK2 and RQ kernel on USP7.

  6. RQ kernel: Consistently underperforms compared to other kernels across all datasets.

The MPRO dataset seems to be the most amenable to high performance across multiple kernels and strategies.

MACCS :

RQ Kernel

Pasted image 20240924132023.png

Matern Kernel

Pasted image 20240924132056.png

Tanimoto Kernel

MACCS Tanimoto.png

Linear Kernel

MACCS Linear Kernel.png

RBF Kernel

MACCS RBF Kernel.png

maccs_heatmap_with_protocols.png|700

  1. D2R Dataset:
  1. MPRO Dataset:
  1. TYK2 Dataset:
  1. USP7 Dataset:

Overall observations:

  1. Dataset difficulty: TYK2 appears to be the most challenging dataset, while MPRO allows for the highest performance across various kernels and strategies.

  2. Strategy effectiveness: The effectiveness of strategies varies across datasets. For instance, ucb-balanced works well for D2R and USP7 with Linear and Tanimoto kernels, while ucb-exploit-heavy is effective for MPRO with KernelRBF.

  3. Kernel performance:

    • The Linear and Tanimoto kernels consistently show identical performance across all datasets.
    • KernelRBF shows high variability, performing best on MPRO and USP7 but worst on TYK2.
    • The Matern kernel generally underperforms compared to other kernels, except on MPRO.
  4. Performance consistency: MPRO shows the most consistent high performance across different kernels, while other datasets, especially TYK2, show more variation.

  5. RQ kernel: Consistently underperforms compared to other kernels across all datasets, never achieving the best performance for any dataset.

  6. Best strategies: ucb-balanced and ucb-alternate appear more frequently among the top-performing combinations, suggesting they might be more robust across different datasets and kernels.

The MPRO dataset seems to be the most amenable to high performance across multiple kernels and strategies, while TYK2 presents the most challenges. The consistent performance of Linear and Tanimoto kernels across datasets suggests they might be more reliable choices when working with MACCS fingerprints.

Overall analysis :

Based on the three kernel performance heatmaps for MACCS, CHEMBERTA, and ECFP, overall analysis:

  1. Dataset Performance:

    • MPRO consistently shows the highest performance across all three fingerprint types.
    • TYK2 consistently shows the lowest performance across all three fingerprint types.
    • USP7 and D2R generally show moderate performance, with USP7 often performing better than D2R.
  2. Kernel Performance:

    • Linear and Tanimoto kernels show identical performance within each fingerprint type, suggesting they may be functionally equivalent for these tasks.
    • KernelRBF (Radial Basis Function) shows high variability across datasets and fingerprint types, often achieving the highest performance (e.g., MPRO in MACCS and CHEMBERTA) but also the lowest in some cases (e.g., TYK2 in MACCS).
    • Matern kernel generally underperforms compared to other kernels, especially in MACCS and CHEMBERTA.
    • RQ (Rational Quadratic) kernel consistently underperforms across all datasets and fingerprint types.
  3. Strategy Effectiveness:

    • ucb-exploit-heavy and ucb-alternate strategies frequently appear among top-performing combinations, especially for MPRO and USP7 datasets.
    • ucb-balanced strategy often performs well with Linear and Tanimoto kernels across different fingerprint types.
    • Random strategy, when used, consistently shows poor performance, particularly evident in CHEMBERTA and ECFP for the TYK2 dataset.
  4. Fingerprint Type Comparison:

    • ECFP shows the highest peak performances, with values reaching 0.994 for MPRO dataset.
    • CHEMBERTA shows similar peak performances to ECFP, also reaching 0.994 for MPRO dataset.
    • MACCS shows lower peak performances compared to ECFP and CHEMBERTA, with a maximum of 0.776 for MPRO dataset.
  5. Consistency:

    • ECFP and CHEMBERTA show more consistent high performances across different kernels for well-performing datasets (MPRO and USP7).
    • MACCS shows more variability in performance across kernels and datasets.
  6. Best Overall Combinations:

    • For MPRO: KernelRBF with ucb-exploit-heavy strategy performs best across all fingerprint types.
    • For USP7: KernelRBF or Matern kernel with ucb-alternate strategy generally performs well, especially in CHEMBERTA and ECFP.
    • For D2R and TYK2: Linear or Tanimoto kernel with ucb-balanced or ucb-gradual strategy often performs best, though performance is generally lower than for MPRO and USP7.