Published November 2021 | Version v1
Journal article

Extending Approximate Bayesian Computation with Supervised Machine Learning to infer demographic history from genetic polymorphisms using DIYABC Random Forest

Others:
Institut Montpelliérain Alexander Grothendieck (IMAG) ; Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)
Institut Sophia Agrobiotech (ISA) ; Université Nice Sophia Antipolis (1965 - 2019) (UNS) ; COMUE Université Côte d'Azur (2015-2019) (COMUE UCA)-COMUE Université Côte d'Azur (2015-2019) (COMUE UCA)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche pour l'Agriculture, l'Alimentation et l'Environnement (INRAE)-Université Côte d'Azur (UCA)
Centre de Biologie pour la Gestion des Populations (UMR CBGP) ; Centre de Coopération Internationale en Recherche Agronomique pour le Développement (Cirad)-Université de Montpellier (UM)-Institut de Recherche pour le Développement (IRD [France-Sud])-Institut National de Recherche pour l'Agriculture, l'Alimentation et l'Environnement (INRAE)-Institut Agro - Montpellier SupAgro ; Institut national d'enseignement supérieur pour l'agriculture, l'alimentation et l'environnement (Institut Agro)-Institut national d'enseignement supérieur pour l'agriculture, l'alimentation et l'environnement (Institut Agro)
This work was supported by funds from the French Agence Nationale pour la Recherche (projects SWING ANR-16-CE02-0015, GANDHI ANR-20-CE02-0018), ABSint ANR-18-CE40-0034, the INRAE scientific division SPE (AAP-SPE 2016), and the LabEx NUMEV ANR-10-LABX-0020.
ANR-16-CE02-0015,SWING,Invasion mondiale de la drosophile à aile tachetée: Génétique, plasticité et potentiel évolutif(2016)
ANR-20-CE02-0018,GANDHI,Génomique de l'invasion de la coccinelle Harmonia axyridis(2020)
ANR-18-CE40-0034,ABSint,Solutions bayésiennes approchées pour l'inférence dans de grands jeux de données et dans des modèles complexes(2018)
ANR-10-LABX-0020,NUMEV,Digital and Hardware Solutions and Modeling for the Environement and Life Sciences(2010)

Description

Simulation-based methods such as Approximate Bayesian Computation (ABC) are well-adapted to the analysis of complex scenarios of populations and species genetic history. In this context, supervised machine learning (SML) methods provide attractive statistical solutions to conduct efficient inferences about scenario choice and parameter estimation. The Random Forest methodology (RF) is a powerful ensemble of SML algorithms used for classification or regression problems. RF allows conducting inferences at a low computational cost, without preliminary selection of the relevant components of the ABC summary statistics, and bypassing the derivation of ABC tolerance levels. We have implemented a set of RF algorithms to process inferences using simulated datasets generated from an extended version of the population genetic simulator implemented in DIYABC v2.1.0. The resulting computer package, named DIYABC Random Forest v1.0, integrates two functionalities into a user-friendly interface: the simulation under custom evolutionary scenarios of different types of molecular data (microsatellites, DNA sequences or SNPs) and RF treatments including statistical tools to evaluate the power and accuracy of inferences. We illustrate the functionalities of DIYABC Random Forest v1.0 for both scenario choice and parameter estimation through the analysis of pseudo-observed and real datasets corresponding to pool-sequencing and individual-sequencing SNP datasets. Because of the properties inherent to the implemented RF methods and the large feature vector (including various summary statistics and their linear combinations) available for SNP data, DIYABC Random Forest v1.0 can efficiently contribute to the analysis of large SNP datasets to make inferences about complex population genetic histories.

Abstract

International audience

Additional details

Created:
December 4, 2022
Modified:
November 29, 2023