Abstract:
Recent advances in convolutional neural networks have inspired the application of deep learning to other
disciplines. Even though image processing and natural language processing have turned out to be the most
successful, there are many other domains that have also benefited; among them, life sciences in general
and chemistry and drug design in particular. In concordance with this observation, from 2018 the
scientific community has seen a surge of methodologies related to the generation of diverse molecular
libraries using machine learning. However to date, attention mechanisms have not been employed for
the problem of de novo molecular generation. Here we employ a variant of transformers, an architecture
recently developed for natural language processing, for this purpose. Our results indicate that the
adapted Transmol model is indeed applicable for the task of generating molecular libraries and leads to
statistically significant increases in some of the core metrics of the MOSES benchmark. The presented
model can be tuned to either input-guided or diversity-driven generation modes by applying a standard
one-seed and a novel two-seed approach, respectively. Accordingly, the one-seed approach is best
suited for the targeted generation of focused libraries composed of close analogues of the seed
structure, while the two-seeds approach allows us to dive deeper into under-explored regions of the
chemical space by attempting to generate the molecules that resemble both seeds. To gain more
insights about the scope of the one-seed approach, we devised a new validation workflow that involves
the recreation of known ligands for an important biological target vitamin D receptor. To further benefit
the chemical community, the Transmol algorithm has been incorporated into our cheML.io web
database of ML-generated molecules as a second generation on-demand methodology