Аннотация:
The drug discovery and development is a very complex, time and cost intensive process
with multiple steps. One of the key steps in this process is the identification
of the binding sites between protein and ligands. There are various data resources
that can serve in the step of high throughput in silico screening, a process for drug
candidate search such as to find a molecule for specific target that has the required
chemical and biological properties, and can be used in the further downstream process
of the drug development. These large databases in combination with big data
processing methods, such as data mining, data fusion and data integration, are attracting
much attention from various scientific communities in studying the problem
of the automation of drug design. Big data processing methods and techniques are
efficient and capable of implementing screening for molecular properties and drug
design for millions of chemical compounds. In particular, the automated analysis
and prediction methods of protein binding sites and potential ligand conformations
can accelerate and effectively advance the drug development process. There are many
gaps that would need an improvement in the current automated drug design in particular
for the processing and analysis of protein-ligand interaction that would increase
the accuracy of the final prediction. Some of the areas for development would be the
removal of inaccuracies and errors in the initial dataset creation and processing for
the used training sets of compounds. These problems are critical to solve because
mistakes can lead to serious health and economic consequences such as harm to patients
and severe financial losses for the stakeholders. In the proposed approach, a
multi-step pipeline for protein-ligand interaction analysis based on the large dataset
of compounds from the Protein Data Bank was considered. The result of this pipeline
would ultimately provide a practical way to manipulate large-scale chemical data using
familiar software for specialists in molecular chemistry without consuming large
amounts of computing power.