A Flexible Structured-based Representation for XML Document Mining

Vercoustre, Anne-Marie; Fegas, Mounir; Gul, Saba; Lechevallier, Yves

Published November 2005 | Version v1

Conference paper Metadata-only

A Flexible Structured-based Representation for XML Document Mining

Contributors

Others:

Usage-centered design, analysis and improvement of information systems (AxIS) ; Centre Inria d'Université Côte d'Azur (CRISAM) ; Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Inria Paris-Rocquencourt ; Institut National de Recherche en Informatique et en Automatique (Inria)
Norbert Fuhr
Mounia Lalmas
Saadia Malik
Gabriella Kazai

This paper reports on the INRIA group's approach to XML mining while participating in the INEX XML Mining track 2005. We use a flexible representation of XML documents that allows taking into account the structure only or both the structure and content. Our approach consists of representing XML documents by a set of their sub-paths, defined according to some criteria (length, root beginning, leaf ending). By considering those sub-paths as words, we can use standard methods for vocabulary reduction, and simple clustering methods such as K-means that scale well. We actually use an implementation of the clustering algorithm known as "dynamic clouds" that can work with distinct groups of independent variables put in separate variables. This is useful in our model since embedded sub-paths are not independent: we split potentially dependant paths into separate variables, resulting in each of them containing independant paths. Experiments with the INEX collections show good results for the structure-only collections, but our approach could not scale well for large structure-and-content collections.

Abstract

This is the authors' version. To access the final version go to the editor's site through the DOI./http://www.springerlink.com

Additional details

URL: https://inria.hal.science/inria-00000839
URN: urn:oai:HAL:inria-00000839v2

Origin repository: UNICA

	All versions	This version
Views	4	4
Downloads	0	0
Data volume	0 Bytes	0 Bytes

A Flexible Structured-based Representation for XML Document Mining

Creators

Contributors

Others:

Description

Abstract

Additional details

Identifiers

Origin repository