Published December 17, 2021 | Version v1
Publication

Model-based Clustering with Missing Not At Random Data

Contributors

Others:

Description

Traditional ways for handling missing values are not designed for the clustering purpose and they rarely apply to the general case, though frequent in practice, of Missing Not At Random (MNAR) values. This paper proposes to embed MNAR data directly within model-based clustering algorithms. We introduce a mixture model for different types of data (continuous, count, categorical and mixed) to jointly model the data distribution and the MNAR mechanism. Eight different MNAR models are proposed, which may depend on the underlying (unknown) classes and/or the values of the missing variables themselves. We prove the identifiability of the parameters of both the data distribution and the mechanism, whatever the type of data and the mechanism, and propose an EM or Stochastic EM algorithm to estimate them. The code is available on \url{https://github.com/AudeSportisse/Clustering-MNAR}. %\url{https://anonymous.4open.science/r/Clustering-MNAR-0201} We also prove that MNAR models for which the missingness depends on the class membership have the nice property that the statistical inference can be carried out on the data matrix concatenated with the mask by considering a MAR mechanism instead. Finally, we perform empirical evaluations for the proposed sub-models on synthetic data and we illustrate the relevance of our method on a medical register, the TraumaBase$^{\mbox{\normalsize{\textregistered}}}$ dataset.

Additional details

Identifiers

URL
https://hal.archives-ouvertes.fr/hal-03494674
URN
urn:oai:HAL:hal-03494674v2

Origin repository

Origin repository
UNICA