Published August 5, 2020 | Version v1
Publication

Spam detection with a content-based random-walk algorithm

Description

In this work we tackle the problem of the spam detection on the Web. Spam web pages have become a problem for Web search engines, due to the negative effects that this phe-nomenon can cause in their retrieval results. Our approach is based on a random-walk algorithm that obtains a ranking of pages according to their relevance and their spam likelihood. We introduce the novelty of taking into account the content of the web pages to characterize the web graph and to ob-tain an a- priori estimation of the spam likekihood of the web pages. Our graph-based algorithm computes two scores for each node in the graph. Intuitively, these values represent how bad or good (spam-like or not) is a web page, according to its textual content and the relations in the graph. Our experiments show that our proposed technique outperforms other link-based techniques for spam detection.

Abstract

Ministerio de Educación y Ciencia HUM2007-66607-C04-04

Additional details

Created:
December 5, 2022
Modified:
November 27, 2023