Home > Preprint, Research paper > Efficient and Effective Spam Filtering and Re-ranking for Large Web Datasets

Efficient and Effective Spam Filtering and Re-ranking for Large Web Datasets

Gordon V. Cormack, Mark D. Smucker, and Charles L. A. Clarke
University of Waterloo

The TREC 2009 web ad hoc and relevance feedback tasks used a new document collection, the ClueWeb09 dataset, which was crawled from the general Web in early 2009. This dataset contains 1 billion web pages, a substantial fraction of which are spam — pages designed to deceive search engines so as to deliver an unwanted payload. We examine the effect of spam on the results of the TREC 2009 web ad hoc and relevance feedback tasks, which used the ClueWeb09 dataset.

We show that a simple content-based classifier with minimal training is efficient enough to rank the “spamminess” of every page in the dataset using a standard personal computer in 48 hours, and effective enough to yield significant and substantive improvements in the fixed-cutoff precision (estP10) as well as rank measures (estR Precision, StatMAP, MAP) of nearly all submitted runs. Moreover, using a set of “honeypot” queries the labeling of training data may be reduced to an entirely automatic process. The results of classical information retrieval methods are particularly enhanced by filtering — from among the worst to among the best.

Get the paper at arxiv.org

(Commentary from Mounia Lalmas)

The ClueWeb09 dataset was created to support research in information retrieval and related technologies. The dataset was crawled from the general web in early 2009 and consists of 1 billion web pages, in ten languages. The dataset is used by several tracks of the TREC conference, e.g. Web track, Session track. Because the data set is a direct crawl from the web, it is likely to contain a good proportion of spam (self-promotion e.g. word stuffing and mutual promotion e.g. link farm). The aim of this paper is to examine the effect of spam on retrieval effectiveness, and to see what can be done about it.

This examination is very important, as many researchers around the world participating at TREC, are concerned with the development of effective retrieval strategies, and not about, albeit important, spam issues. It is however very likely, and as demonstrated in the paper, that their approaches will not perform too well, not necessarily because their approaches/models/strategies are not effective, although there is always room for improvement, but because they did not account (at all or properly) for the amount of spam in the dataset.

Previous TREC evaluation tracks using web-based dataset have used corpora with little spam. When spam was identified as an issue, e.g. with the Blog track, its impact was not thoroughly examined. The authors claim, I quote “the use of the ClueWeb09 dataset places the spam issue front and center at TREC for the first time.” I fully agree with, and I know that we are going to be seriously confronted with it, as we will be using the dataset for our own work.

The paper provides concrete answers to the spam issue, as accounted with the WebClue09 dataset, which I believe will be of great use/help/interest to the IR/TREC research community:

  • It provides a complete methodology to label large dataset, here ClueWeb09, with minimal computation and training. Each generated label is a percentile score that can be used as input to classify a page as “spam” or “not spam”, or other tasks (re-ranking).
  • Several complete sets of spam labels, available for download at durum0.uwaterloo.ca/clueweb09spam.
  • Extensive experimental results showing a significant and substantive positive impact on effectiveness results, when the labels are used to remove/act upon “spammy” documents. This was demonstrated using the runs officially submitted by participants to the TREC Web ad hoc and relevance feedback tasks.

This paper contains an extensive and systematic study of spam in a large real-world dataset, and the first quantitative results of the impact of spam filtering on retrieval effectiveness. The methodology and the produced sets of spam labels can be used by others (1) to ‘clean’ – whatever the desired level of cleanness — ClueWeb09, or similar data set, so that they can concentrate on their main research aims; and (2) to use as a benchmarks for comparable studies.

  1. May 13, 2010 at 7:44 am

    Two comments:


    {{{ Spam has been identified as an issue in the Blog Track [20], but its impact has not been systematically studied.}}}

    Mounia has also reprised the same claim almost verbatim.

    We believe that these claims do not cover the work done by the TREC Blog track organisers in examining spam in the blogosphere. In particular, see the following three uncited papers:

    * TREC 2006 Blog Overview examined the impact of spam on opinion-finding runs
    * SIGIR 2009 poster examined how this was correlated with system performance
    * ICWSM 2008 paper examined which topics (for both opinion finding and blog distillation) were most affected by spam.

    The assumed splogs that were inserted into Blogs06 corpus during its creation are now available at:

    The above dataset has been released to facilitate and encourage the development of of appropriate techniques for splogs identification. More information could be found in:

    Is Spam an Issue for Opinionated Blog Post Search?
    C. Macdonald, I. Ounis and I. Soboroff. In Proceedings of SIGIR 2009. Boston, USA.


    {{{Several complete sets of spam labels, available for download at durum0.uwaterloo.ca/clueweb09spam.}}}

    Actually, these are spam scores, not the classification ground truth labels used in the paper.

  2. Ian
    May 13, 2010 at 9:34 am

    To be fair, I think that the blog track looked at things a little bit differently. Because spam was injected, the prevalence of spam was manually controlled, and additionally, uninjected spam was not identified.

    I would like to use these labels to run a spam scorer on the Blogs08 collection. I have Gord’s code so it should be easy to set up (and I need to do it for the TREC assessors anyway!)

  3. May 13, 2010 at 10:06 am

    The claim in the paper implies that there was no work done to assess the effect of spam within the blog track framework. This is simply incorrect. You could argue about the adopted approach within the Blog track, but in this case, the issue should be discussed and appropriate citations included in the paper (as per my first comment), so that readers have a more accurate picture and could make their own mind.

    Since at least one reader (Mounia Lalmas) got the wrong impression and then propagated it, it is only fair to raise the issue, so that it is clarified.

    In fact, in addition to the organisers’ investigation, several groups deployed various mechanisms to combat spam within the Blog track.

  4. Gordon Cormack
    May 13, 2010 at 11:22 am


    0. Thanks for the feedback.

    1. Happy to clarify/expand on the blog track efforts. Am I correct in assuming that your general result was that from 17% known spam blogs, systems retrieved about 10% spam. When that spam was removed MAP scores increased from in the neighbourhood of 0.32 to 0.34. Removing spam didn’t cause a substantial number of inversions in performance (Kendall Tau 0.96) so from the point of view of ranking systems, it appeared to be not much of a factor.

    2. I suppose “scores” would be a better word than labels. Does the world at large take “label” to mean human-assigned label? We have preserved our two small sets of human-assigned labels. One thought is to use them for a secondary assessment of the TREC 2010 spam task submissions. Are they more valuable kept in escrow or published?

  5. February 16, 2012 at 3:43 pm

    Enjoyed your paper. I’m amazed that such a simple approach works.

    In your code in “http://arxiv.org/pdf/1004.5168v1.pdf”

    h = b %

    should read

    h = b % P;

    in both places.

    John Nagle

  6. February 17, 2012 at 12:37 am

    So why does this work? It’s simple enough. It looks at successive sequences of four bytes (not even characters for Unicode), hashes them into about a million buckets, and computes some simple stats. What common factor do spam pages have which trips such a filter?

    That’s easy. Ads. Every page with a Google ad has something like this:

    google_ad_client = “pub-5268721215000735″;
    google_ad_slot = “4222970967″;
    google_ad_width = 250;
    google_ad_height = 250;

    Each ad vendor has some stylized form of advertising code, which is cut and pasted with minor modifications into each page with ads.That’s exactly the sort of thing a 4-byte recognizer will pick up. Comparable results could probably be obtained by counting links
    to known ad sites.

    This is similar to signature-based virus recognition, which was the state of the art in anti-virus tools about a decade ago. Virus writers have since advanced to “polymorphic” viruses, where each copy is different. Modern viruses use encryption, compression, and
    randomization to spread their bit patterns over a wide space to prevent simple signature recognition from detecting them. The same techniques could be used on spam web pages to defeat this class of filter.

    John Nagle

  1. May 18, 2010 at 10:29 am

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: