Document Type

Article

Publication Date

1-28-2026

Abstract

Scaffold-aware artificial intelligence (AI) models enable systematic exploration of chemical space conditioned on protein-interacting ligands, yet the representational principles governing their behavior remain poorly understood. The computational representation of structurally complex kinase small molecules remains a formidable challenge due to the high conservation of ATP active site architecture across the kinome and the topological complexity of structural scaffolds in current generative AI frameworks. In this study, we present a diagnostic, modular and chemistry-first generative framework for design of targeted SRC kinase ligands by integrating ChemVAE-based latent space modeling, a chemically interpretable structural similarity metric (Kinase Likelihood Score), Bayesian optimization, and cluster-guided local neighborhood sampling. Using a comprehensive dataset of protein kinase ligands, we examine scaffold topology, latent-space geometry, and model-driven generative trajectories. We show that chemically distinct scaffolds can converge toward overlapping latent representations, revealing intrinsic degeneracy in scaffold encoding, while specific topological motifs function as organizing anchors that constrain generative diversification. The results demonstrate that kinase scaffolds spanning 37 protein kinase families spontaneously organize into a coherent, low-dimensional manifold in latent space, with SRC-like scaffolds acting as a structural “hub” that enables rational scaffold transformation. Our local sampling approach successfully converts scaffolds from other kinase families (notably LCK) into novel SRC-like chemotypes, with LCK-derived molecules accounting for ~40% of high-similarity outputs. However, both generative strategies reveal a critical limitation: SMILES-based representations systematically fail to recover multi-ring aromatic systems—a topological hallmark of kinase chemotypes—despite ring count being a top feature in our structural similarity metric. This “representation gap” demonstrates that no amount of scoring refinement can compensate for a generative engine that cannot access topologically constrained regions. By diagnosing these constraints within a transparent pipeline and reframing scaffold-aware ligand design as a problem of molecular representation our work provides a conceptual framework for interpreting generative model behavior and for guiding the incorporation of structural priors into future molecular AI architectures.

Comments

This article was originally published in Biomolecules, volume 16, issue 2, in 2026. https://doi.org/10.3390/biom16020209

biomolecules-16-00209-s001.zip (46644 kB)
Supplementary Materials

Peer Reviewed

1

Copyright

The authors

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.