Published March 11, 2022 | Version v1
Publication

PolaritySpam: Propagating Content-based Information Through a Web-Graph to Detect Web Spam

Description

Spam web pages have become a problem for Information Retrieval systems due to the negative effects that this phenomenon can cause in their results. In this work we tackle the problem of detecting these pages with a propagation algorithm that, taking as input a web graph, chooses a set of spam and not-spam web pages in order to spread their spam likelihood over the rest of the network. Thus we take advantage of the links between pages to obtain a ranking of pages according to their relevance and their spam likelihood. Our intuition consists in giving a high reputation to those pages related to relevant ones, and giving a high spam likelihood to the pages linked to spam web pages. We introduce the novelty of including the content of the web pages in the computation of an a priori estimation of the spam likelihood of the pages, and propagate this information. Our graph-based algorithm computes two scores for each node in the graph. Intuitively, these values represent how bad or good (spam-like or not) is a web page, according to its textual content and its relations in the graph. The experimental results show that our method outperforms other techniques for spam detection

Abstract

Ministerio de Educación y Ciencia HUM2007-66607-C04-04

Additional details

Created:
March 25, 2023
Modified:
November 30, 2023