Published October 9, 2015 | Version v1
Publication

Choice of V for V-Fold Cross-Validation in Least-Squares Density Estimation

Others:
Laboratoire de Mathématiques d'Orsay (LM-Orsay) ; Université Paris-Sud - Paris 11 (UP11)-Centre National de la Recherche Scientifique (CNRS)
Statistical Machine Learning and Parsimony (SIERRA) ; Département d'informatique - ENS Paris (DI-ENS) ; École normale supérieure - Paris (ENS-PSL) ; Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS-PSL) ; Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Inria Paris-Rocquencourt ; Institut National de Recherche en Informatique et en Automatique (Inria)
Département d'informatique - ENS Paris (DI-ENS) ; École normale supérieure - Paris (ENS-PSL) ; Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)
Laboratoire Jean Alexandre Dieudonné (JAD) ; Université Nice Sophia Antipolis (1965 - 2019) (UNS) ; COMUE Université Côte d'Azur (2015-2019) (COMUE UCA)-COMUE Université Côte d'Azur (2015-2019) (COMUE UCA)-Centre National de la Recherche Scientifique (CNRS)-Université Côte d'Azur (UCA)
GARGANTUA project (CNRS, Mastodons)
ANR-11-BS01-0010,Calibration,Calibration statistique(2011)
ANR-09-JCJC-0027,DETECT,DETECT: Nouvelles approches statistiques pour la vision artificielle et la bioinformatique(2009)

Description

This paper studies V-fold cross-validation for model selection in least-squares density estimation. The goal is to provide theoretical grounds for choosing V in order to minimize the least-squares loss of the selected estimator. We first prove a non-asymptotic oracle inequality for V-fold cross-validation and its bias-corrected version (V-fold penalization). In particular, this result implies that V-fold penalization is asymptotically optimal in the nonparametric case. Then, we compute the variance of V-fold cross-validation and related criteria, as well as the variance of key quantities for model selection performance. We show that these variances depend on V like 1+4/(V-1), at least in some particular cases, suggesting that the performance increases much from V=2 to V=5 or 10, and then is almost constant. Overall, this can explain the common advice to take V=5---at least in our setting and when the computational power is limited---, as supported by some simulation experiments. An oracle inequality and exact formulas for the variance are also proved for Monte-Carlo cross-validation, also known as repeated cross-validation, where the parameter V is replaced by the number B of random splits of the data.

Additional details

Created:
March 25, 2023
Modified:
December 1, 2023