Published August 24, 2022
| Version v1
Publication
PeTriBERT : Augmenting BERT with tridimensional encoding for inverse protein folding and design
Contributors
Others:
- Scientific Data Management (ZENITH) ; Centre Inria d'Université Côte d'Azur (CRISAM) ; Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier (LIRMM) ; Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)
- Bionomeex [Montpellier]
- Institut des Sciences des Plantes de Montpellier (IPSIM) ; Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche pour l'Agriculture, l'Alimentation et l'Environnement (INRAE)-Institut Agro Montpellier ; Institut national d'enseignement supérieur pour l'agriculture, l'alimentation et l'environnement (Institut Agro)-Institut national d'enseignement supérieur pour l'agriculture, l'alimentation et l'environnement (Institut Agro)-Université de Montpellier (UM)
Description
Protein is biology workhorse. Since the recent break-through of novel folding methods, the amount of available structural data is increasing, closing the gap between data-driven sequence-based and structure-based methods. In this work, we focus on the inverse folding problem that consists in predicting an amino-acid primary sequence from protein 3D structure. For this purpose, we introduce a simple Transformer model from Natural Language Processing augmented 3D-structural data. We call the resulting model PeTriBERT: Proteins embedded in tridimensional representation in a BERT model. We train this small 40-million parameters model on more than 350 000 proteins sequences retrieved from the newly available AlphaFoldDB database. Using PetriBert, we are able to in silico generate totally new proteins with a GFP-like structure. These 9 of 10 of these GFP structural homologues have no ressemblance when blasted on the whole entry proteome database. This shows that PetriBert indeed capture protein folding rules and become a valuable tool for de novo protein design.
Additional details
Identifiers
- URL
- https://hal.inrae.fr/hal-03759515
- URN
- urn:oai:HAL:hal-03759515v1
Origin repository
- Origin repository
- UNICA