AntiMicrobial-Knowledge Graph data resource: Machine learning model to identify new antibiotics

AntiMicrobial Knowledge Graph (KG) is an exhaustive data warehouse of experimentally validated antibacterial chemicals covering Gram-positive, Gram-negative, acid-fast bacteria and fungi. The AntiMicrobial-KG connects data from different compound libraries to aid the identification of new antimicrobial compounds. The machine learning model (ML) has been trained on publicly available in vitro data, representing the first ever MIC (minimum inhibitory concentration) aggregated dataset in a FAIR-compliant format. The models are customizable and open source. They are also interpretable, making it possible to decipher the physicochemical properties required for bacterial and fungal activity, supporting chemical optimization in antimicrobial drug discovery.

The model has been published in Journal of Chemical Information and Modeling and all resources are available for download from the AntiMicrobial-KG website at SciLifeLab, allowing users to search the database and use the pre-trained models for compound activity prediction. For those who want to download and run the AntiMicrobial-KG locally.

  • The antimicrobial knowledge graph (AntiMicrobial-KG) is a repository for collecting and visualizing public in vitro antibacterial assay. Utilizing this data, we built ML models to efficiently scan compound libraries to identify compounds with the potential to exhibit antimicrobial activity. The strategy involved training seven classic ML models across six compound fingerprint representations.
  • Gadiya Y, Genilloud O, Bilitewski U, et al. Predicting Antimicrobial Class Specificity of Small Molecules Using Machine Learning. Journal of Chemical Information and Modeling. Published online February 23, 2025. 2025:65(5);2416-2413. doi:10.1021/ACS.JCIM.4C02347

The AntiMicrobial Knowledge Graph (KG) is an exhaustive data warehouse of experimentally validated antibacterial chemicals covering Gram-positive, Gram-negative, acid-fast bacteria and fungi. The construction of the AntiMicrobial-KG involved collecting minimum inhibitory concentration (MIC) data from three different public data resources that can be accessed from https://antimicrobial-kg.serve.scilifelab.se/.

Data and models are also available for download from the IMI-COMBINE GitHub repository and Zenodo.

How was the AntiMicrobial-KG developed?

The AntiMicrobial-KG Database was developed within the framework of the Innovative Medicines Initiative (IMI) AMR Accelerator program’s Scientific Interest Group on Machine Learning, coordinated by the COMBINE project.. Utilizing this data, the AMR Accelerator projects have built ML models to efficiently scan compound libraries to identify compounds with the potential to exhibit antimicrobial activity. Using Random Forest and XGBoost algorithms, classification models on four classes of microorganisms (gram-positive, gram-negative, acid-fast, and fungi), that outperform existing models, were developed. The ML model has been tested on the EU-OPENSCREEN screening library to demonstrate its applicability in a laboratory setting.

The ML model is trained on a large aggregation of a public bioassay datasets. The model and code can be downloaded and trained on other datasets, expanding the applicability and confidence of model predictions for research and development. The AntiMicrobial-KG and models are built from Python scripts for model training, exploratory analysis, and KG generation, available on GitHub.

How does the AntiMicrobial-KG add value?

Making selections of compounds based on model predictions could eventually decrease the experimental cost associated with antimicrobial screening. As the model is pre-trained, it can be used to build a compound library from scratch with chemicals that have a higher tendency to demonstrate activity in vitro. It is also cost-effective: instead of using high-throughput screening of different compound libraries to identify an active compound, the model can identify a subset of compounds that are more likely to be active. By using the ML model in early compound screening, ML predictions of compound libraries can reduce the cost associated with screening by filtering libraries into smaller subsets with a higher probability of activity.