Published August 5, 2020
| Version v1
Publication
Spam detection with a content-based random-walk algorithm
Description
In this work we tackle the problem of the spam detection on the
Web. Spam web pages have become a problem for Web search
engines, due to the negative effects that this phe-nomenon can
cause in their retrieval results. Our approach is based on a
random-walk algorithm that obtains a ranking of pages
according to their relevance and their spam likelihood. We
introduce the novelty of taking into account the content of the
web pages to characterize the web graph and to ob-tain an a-
priori estimation of the spam likekihood of the web pages. Our
graph-based algorithm computes two scores for each node in the
graph. Intuitively, these values represent how bad or good
(spam-like or not) is a web page, according to its textual content
and the relations in the graph. Our experiments show that our
proposed technique outperforms other link-based techniques
for spam detection.
Abstract
Ministerio de Educación y Ciencia HUM2007-66607-C04-04Additional details
Identifiers
- URL
- https://idus.us.es/handle//11441/100111
- URN
- urn:oai:idus.us.es:11441/100111