Choice of V for V-Fold Cross-Validation in Least-Squares Density Estimation
- Creators
- Arlot, Sylvain
- Lerasle, Matthieu
- Others:
- Laboratoire de Mathématiques d'Orsay (LM-Orsay) ; Université Paris-Sud - Paris 11 (UP11)-Centre National de la Recherche Scientifique (CNRS)
- Statistical Machine Learning and Parsimony (SIERRA) ; Département d'informatique - ENS Paris (DI-ENS) ; École normale supérieure - Paris (ENS-PSL) ; Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS-PSL) ; Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Inria Paris-Rocquencourt ; Institut National de Recherche en Informatique et en Automatique (Inria)
- Département d'informatique - ENS Paris (DI-ENS) ; École normale supérieure - Paris (ENS-PSL) ; Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)
- Laboratoire Jean Alexandre Dieudonné (JAD) ; Université Nice Sophia Antipolis (1965 - 2019) (UNS) ; COMUE Université Côte d'Azur (2015-2019) (COMUE UCA)-COMUE Université Côte d'Azur (2015-2019) (COMUE UCA)-Centre National de la Recherche Scientifique (CNRS)-Université Côte d'Azur (UCA)
- GARGANTUA project (CNRS, Mastodons)
- ANR-11-BS01-0010,Calibration,Calibration statistique(2011)
- ANR-09-JCJC-0027,DETECT,DETECT: Nouvelles approches statistiques pour la vision artificielle et la bioinformatique(2009)
Description
This paper studies V-fold cross-validation for model selection in least-squares density estimation. The goal is to provide theoretical grounds for choosing V in order to minimize the least-squares loss of the selected estimator. We first prove a non-asymptotic oracle inequality for V-fold cross-validation and its bias-corrected version (V-fold penalization). In particular, this result implies that V-fold penalization is asymptotically optimal in the nonparametric case. Then, we compute the variance of V-fold cross-validation and related criteria, as well as the variance of key quantities for model selection performance. We show that these variances depend on V like 1+4/(V-1), at least in some particular cases, suggesting that the performance increases much from V=2 to V=5 or 10, and then is almost constant. Overall, this can explain the common advice to take V=5---at least in our setting and when the computational power is limited---, as supported by some simulation experiments. An oracle inequality and exact formulas for the variance are also proved for Monte-Carlo cross-validation, also known as repeated cross-validation, where the parameter V is replaced by the number B of random splits of the data.
Additional details
- URL
- https://hal.science/hal-00743931
- URN
- urn:oai:HAL:hal-00743931v3
- Origin repository
- UNICA