Published 2007
| Version v1
Book section
Mining XML Documents
Contributors
Others:
- Groupe de Recherche en Apprentissage Automatique (GRAppA - LIFL) ; Université de Lille, Sciences et Technologies-Université de Lille, Sciences Humaines et Sociales-Centre National de la Recherche Scientifique (CNRS)
- Machine Learning and Information Retrieval (MALIRE) ; Laboratoire d'Informatique de Paris 6 (LIP6) ; Université Pierre et Marie Curie - Paris 6 (UPMC)-Centre National de la Recherche Scientifique (CNRS)-Université Pierre et Marie Curie - Paris 6 (UPMC)-Centre National de la Recherche Scientifique (CNRS)
- Laboratoire Logiciels Systèmes Réseaux (LSR - IMAG) ; Université Joseph Fourier - Grenoble 1 (UJF)-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS)
- Institute of Statistical Mathematics ; University of Graduate Studies
- INRIA Rocquencourt ; Institut National de Recherche en Informatique et en Automatique (Inria)
- Usage-centered design, analysis and improvement of information systems (AxIS) ; Centre Inria d'Université Côte d'Azur (CRISAM) ; Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Inria Paris-Rocquencourt ; Institut National de Recherche en Informatique et en Automatique (Inria)
- P. Poncelet
- F. Masseglia
- M. Teisseire
Description
XML documents are becoming ubiquitous because of their rich and flexible format that can be used for a variety of applications. Giving the increasing size of XML collections as information sources, mining techniques that traditionally exist for text collections or databases need to be adapted and new methods to be invented to exploit the particular structure of XML documents. Basically XML documents can be seen as trees, which are well known to be complex structures. This chapter describes various ways of using and simplifying this tree structure to model documents and support efficient mining algorithms. We focus on three mining tasks: classification and clustering which are standard for text collections; discovering of frequent tree structure which is especially important for heterogeneous collection. This chapter presents some recent approaches and algorithms to support these tasks together with experimental evaluation on a variety of large XML collections.
Additional details
Identifiers
- URL
- https://inria.hal.science/inria-00188899
- URN
- urn:oai:HAL:inria-00188899v1
Origin repository
- Origin repository
- UNICA