Published July 2023 | Version v1
Conference paper

On the Robustness of Text Vectorizers

Others:
Université Côte d'Azur (UCA)
Modèles et algorithmes pour l'intelligence artificielle (MAASAI) ; Inria Sophia Antipolis - Méditerranée (CRISAM) ; Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Nice Sophia Antipolis (1965 - 2019) (UNS)-Laboratoire Jean Alexandre Dieudonné (LJAD) ; Université Nice Sophia Antipolis (1965 - 2019) (UNS)-Centre National de la Recherche Scientifique (CNRS)-Université Côte d'Azur (UCA)-Centre National de la Recherche Scientifique (CNRS)-Université Côte d'Azur (UCA)-Scalable and Pervasive softwARe and Knowledge Systems (Laboratoire I3S - SPARKS) ; Laboratoire d'Informatique, Signaux, et Systèmes de Sophia Antipolis (I3S) ; Université Nice Sophia Antipolis (1965 - 2019) (UNS)-Centre National de la Recherche Scientifique (CNRS)-Université Côte d'Azur (UCA)-Université Nice Sophia Antipolis (1965 - 2019) (UNS)-Centre National de la Recherche Scientifique (CNRS)-Université Côte d'Azur (UCA)-Laboratoire d'Informatique, Signaux, et Systèmes de Sophia Antipolis (I3S) ; Université Nice Sophia Antipolis (1965 - 2019) (UNS)-Centre National de la Recherche Scientifique (CNRS)-Université Côte d'Azur (UCA)-Centre National de la Recherche Scientifique (CNRS)
ANR-18-CE40-0005,GraVa,Méthodes variationnelles pour les signaux sur graphe(2018)
ANR-21-CE45-0010,NEMATIC,Analyse Modelisation et Simulation Multi-échelle(2021)
ANR-21-CE23-0005,NIM-ML,Méthodes d'explicabilité de nouvelle génération(2021)
European Project: H2020-951911,H2020-EU.2.1.1. - INDUSTRIAL LEADERSHIP - Leadership in enabling and industrial technologies - Information and Communication Technologies (ICT),AI4Media(2020)

Description

A fundamental issue in machine learning is the robustness of the model with respect to changes in the input. In natural language processing, models typically contain a first embedding layer, transforming a sequence of tokens into vector representations. While the robustness with respect to changes of continuous inputs is well-understood, the situation is less clear when considering discrete changes, for instance replacing a word by another in an input sentence. Our work formally proves that popular embedding schemes, such as concatenation, TF-IDF, and Paragraph Vector (a.k.a. doc2vec), exhibit robustness in the H\"older or Lipschitz sense with respect to the Hamming distance. We provide quantitative bounds for these schemes and demonstrate how the constants involved are affected by the length of the document. These findings are exemplified through a series of numerical examples.

Abstract

Accepted to ICML 2023. 33 pages, 10 figures

Abstract

International audience

Additional details

Created:
January 26, 2024
Modified:
January 26, 2024