Tencent AI Lab large-Scale Pre-trained Graph Neural Network Model for Molecular Representation

Introduction

AI aided drug discovery has been suffering from two thorny problems:

  1. Insufficient labeled data for drug molecules;
  2. Insufficient generalization capabilities of existing models: models trained on one molecular dataset are hardly generalized to fit another molecular dataset.

In pursuit of solving the above problems, we developed a new Transformer-based Graph Neural Network encoder, termed GROVER, to capture the rich implicit structural and semantic information from the molecules. We design a self-supervised training strategy to pre-train GROVER with huge amounts of unlabeled molecular data (about 10M unlabeled molecules). We hope that the release of GROVER model will help with boosting the performance of drug discovery applications, such as molecular property prediction and virtual screening.

Highlights

GROVER has encoded rich structural information of molecules through the designing of self-supervision tasks. It also produces feature vectors of atoms and molecule fingerprints, which can serve as inputs of downstream tasks. GROVER is designed based on deep graph neural networks and all the parameters are fully differentiable. So it is easy to fine-tune GROVER in conjunction with specific drug discovery tasks, in order to achieve better performance.

Results on Molecular Property Prediction

ModelBBBPSIDERClinToxBACETox21ToxCast
GROVERbase
GROVERlarge
ModelFreeSolvESOLLipoQM7QM8
GROVERbase
GROVERlarge

How to Use

Researchers and practitioners can easily use GROVER models in two ways:

  1. Without finetuning: Use the output of GROVER as the molecular fingerprints directly.
  2. With finetuning: Use the GROVER model as building blocks in drug development projects that need to encode drug molecules in an end-to-end fashion.

 

Downloads

The downloading resources contain two parts.

  1. The pre-trained models: We provide the pretrained models used in paper.

  2. The fingerprints of 2M molecules for the researchers without machine learning experience. Since the npz file of generated fingerprints for 2M molecules is very large (13G), we release 700K molecules which are collected from the public benchmarks of Moleculenet. All fingerprints are divided into four separate npz files. Each file contains 200K fingerprints.

Basic Usage

The pretrained model

Please refer to our github page for more details.

Fingerprints

Here is the example code snippet to load the pre-computed fingerprints.

Reference

Acknowledgment

The GPU resources and distributed training optimization are supported by Tencent Jizhi(机智) Team and TEG运营管理部. For more information, please refer to link.

Contacts

If you have any questions, please contact us: Link

Disclaimer

Total Visits: .