Published July 9, 2020 | Version v1
Publication

A modular approach for lexical normalization applied to Spanish tweets

Description

Twitter is a social media platform with widespread success where millions of people continuously express ideas and opinions about a myriad of topics. It is a huge and interesting source of data but most of these texts are usually written hastily and very abbreviated, rendering them unsuitable for traditional Natural Language Processing (NLP). The two main contributions of this work are: the characterization of the textual error phenomena in Twitter and the proposal of a modular normalization system that improves the textual quality of tweets. Instead of focusing on a single technique, we propose an extensible normalization system that relies on the combination of several independent ''expert modules'', each one addressing an very specific error phenomenon in its own way, thus increasing module accuracy and lowering the module building costs. Broadly speaking, the system resembles to an ''expert board'': modules independently propose correction candidates for each Out of Vocabulary (OOV) word, rank the candidates and the best one is selected. In order to evaluate our proposal, we perform several experiments using texts from Twitter written in Spanish about a specific topic. The flexibility of defining resources at different language levels (core language, domain, genre) combined with the modular architecture lead to lower costs and a good performance: requiring a minimal effort for building the resources and achieving more than 82% of accuracy compared to the 31% yielded by the baseline.

Abstract

Ministerio de Economía y Competitividad TIN2012-38536-C03-02

Abstract

Junta de Andalucía P11-TIC-7684 MO

Additional details

Created:
March 27, 2023
Modified:
November 30, 2023