Published March 2015 | Version v1
Report

Beyond Two-sample-tests: Localizing Data Discrepancies in High-dimensional Spaces

Description

Comparing two sets of multivariate samples is a central problem in data analysis. From a statisticalstandpoint, the simplest way to perform such a comparison is to resort to a non-parametric two-sampletest (TST), which checks whether the two sets can be seen as i.i.d. samples of an identical unknowndistribution (the null hypothesis). If the null is rejected, one wishes to identify regions accounting forthis difference. This paper presents a two-stage method providing feedback on this difference, basedupon a combination of statistical learning (regression) and computational topology methods.dConsider two populations, each given as a point cloud in R^d. In the first step, we assign a labelto each set and we compute, for each sample point, a discrepancy measure based on comparingan estimate of the conditional probability distribution of the label given a position versus the globalunconditional label distribution. In the second step, we study the height function defined at each pointby the aforementioned estimated discrepancy. Topological persistence is used to identify persistentlocal minima of this height function, their basins defining regions of points with high discrepancy andin spatial proximity.Experiments are reported both on synthetic and real data (satellite images and handwritten digitimages), ranging in dimension from d = 2 to d = 784, illustrating the ability of our method to localizediscrepancies.On a general perspective, the ability to provide feedback downstream TST may prove of ubiquitousinterest in exploratory statistics and data science.

Additional details

Created:
March 25, 2023
Modified:
November 29, 2023