PolaritySpam: Propagating Content-based Information Through a Web-Graph to Detect Web Spam
Description
Spam web pages have become a problem for Information Retrieval systems due to the negative effects that this phenomenon can cause in their results. In this work we tackle the problem of detecting these pages with a propagation algorithm that, taking as input a web graph, chooses a set of spam and not-spam web pages in order to spread their spam likelihood over the rest of the network. Thus we take advantage of the links between pages to obtain a ranking of pages according to their relevance and their spam likelihood. Our intuition consists in giving a high reputation to those pages related to relevant ones, and giving a high spam likelihood to the pages linked to spam web pages. We introduce the novelty of including the content of the web pages in the computation of an a priori estimation of the spam likelihood of the pages, and propagate this information. Our graph-based algorithm computes two scores for each node in the graph. Intuitively, these values represent how bad or good (spam-like or not) is a web page, according to its textual content and its relations in the graph. The experimental results show that our method outperforms other techniques for spam detection
Abstract
Ministerio de Educación y Ciencia HUM2007-66607-C04-04
Additional details
- URL
- https://idus.us.es/handle//11441/130681
- URN
- urn:oai:idus.us.es:11441/130681
- Origin repository
- USE