Published March 11, 2022
| Version v1
Publication
PolaritySpam: Propagating Content-based Information Through a Web-Graph to Detect Web Spam
Description
Spam web pages have become a problem for Information Retrieval systems
due to the negative effects that this phenomenon can cause in their results. In this work
we tackle the problem of detecting these pages with a propagation algorithm that, taking
as input a web graph, chooses a set of spam and not-spam web pages in order to spread
their spam likelihood over the rest of the network. Thus we take advantage of the links
between pages to obtain a ranking of pages according to their relevance and their spam
likelihood. Our intuition consists in giving a high reputation to those pages related to
relevant ones, and giving a high spam likelihood to the pages linked to spam web pages.
We introduce the novelty of including the content of the web pages in the computation of
an a priori estimation of the spam likelihood of the pages, and propagate this information.
Our graph-based algorithm computes two scores for each node in the graph. Intuitively,
these values represent how bad or good (spam-like or not) is a web page, according to its
textual content and its relations in the graph. The experimental results show that our
method outperforms other techniques for spam detection
Abstract
Ministerio de Educación y Ciencia HUM2007-66607-C04-04Additional details
Identifiers
- URL
- https://idus.us.es/handle//11441/130681
- URN
- urn:oai:idus.us.es:11441/130681