Related Works in Active Learning for Ligand Binding Affinity Prediction

Work	Goal	Datasets	Models	Representations	Kernels	Protocols	Pros	Cons
Gorantla et al. (2024)	Benchmark AL protocols for affinity prediction using GP and Chemprop models.	TYK2, USP7, D2R, MPRO	GP and Chemprop	GP: ECFP8 and Morgan fingerprints; Chemprop: SMILES strings	Tanimoto	Various batch sizes and random selection and uncertainty-based exploration strategies.	Provides a baseline for comparing AL protocols. Uses diverse datasets and models.	Limited exploration of kernels and representations.
Thompson et al. (2022)	Optimize AL for free energy calculations by exploring the impact of different design choices.	TYK2	EN, RF, GBR, MLP, GPR	RDKit Morgan fingerprints	Greedy, HS, PI, EI, UCB	Explored initial sample selection strategy, ML model, acquisition function, and the number of molecules sampled per iteration.	Generates an exhaustive RBFE dataset for 10,000 congeneric molecules. Provides insights for AL design.	Limited diversity in the chemical library.
Konze et al. (2019)	Explore synthetically tractable chemical space and optimize potency of CDK2 inhibitors using AL and free energy calculations.	CDK2	AutoQSAR	AutoQSAR	Greedy	5 iterations with 1,000 molecules per iteration.	One of the first applications of AL to RBFE. Investigated a large chemical library.	Limited exploration of ML models and acquisition functions.
Gusev et al. (2023)	Optimize lead compounds based on RBFE calculations using AL.	Multiple	AutoQSAR	AutoQSAR	Mixed	8 iterations with 30-45 molecules sampled per iteration.	Demonstrated the use of AL in lead optimization.	Limited information about the specific AL parameters and datasets used.
Khalak et al. (2022)	Explore chemical space using AL and alchemical free energies.	Multiple	Ensemble MLP	RDKit 2D and 3D descriptors	Narrowing, Greedy, Random, Mixed, Uncertainty	7 iterations with 100 molecules sampled per iteration.	Investigated a variety of acquisition functions and compared them to random selection.	Limited exploration of ML models.

Our Research (based on the conversation transcript):

Expanding the AL benchmarking study by Gorantla et al. (2024):
- Using the same datasets (TYK2, USP7, D2R, MPRO).
- Incorporating a wider range of AL protocols, kernels, and representations.
- Adding ChemBERTa fine-tuning to the models.
Findings:
- TYK2 remains challenging for AL.
- No single best-performing protocol or kernel across all datasets.
- Linear and Tanimoto kernels show consistent performance.
- Matern and RBF kernels are high-risk, high-reward.
- RQ kernel consistently underperforms.
- Different molecular representations capture different molecular properties.

Potential Pros:

Builds upon a recent benchmarking study.
Explores a broader range of AL parameters.
Investigates the use of LLMs for affinity prediction.

Potential Cons:

TYK2 dataset still poses challenges.
Lack of a universally applicable AL strategy.

Next steps (from the conversation):

Create a draft paper with detailed results.
Analyze the embedding vectors to understand data set level differences.
Develop heuristics for protocol selection based on the embedding vectors.
Explore the applicability of Gaussian process models for AL.
Investigate the performance of GP on the BAL datasets.

Overall, the research aims to enhance the understanding of active learning for ligand binding affinity prediction by expanding upon existing works and addressing their limitations.