The missing search engine

The internet needs an editor, but that is against the spirit of the ‘net.¹ The next best thing would be a search engine that is better at eliminating spam sites from search results. Google is wonderful, but the number of spam sites seems to be overwhelming them. There are many sites that scrape contents² from other sites or simply mirror mailing lists. Over the past few months these sites have often filled the first page of my Google searches. This wastes time. The recent increase can probably be attributed to decreases in the cost of web hosting, the fact that modern programming languages and libraries make site scraping easy, and the proliferation newly underemployed techies of the past two years.³

A related category of sites that is often showing up high in search results are sites abusing AJAX to create a popup window that covers the content, but since the content is on the page, it still shows up in searches. Unfortunately some of these sites rank high in Google searches. But I’d never consider actually paying to login. If I really want the info, I could just look at the source, or (locally) hack the code to disable the popup, but why waste time with that? It would be better to have these sites not show up in the first few pages of results.

It will be impressive if someone manages to come up with an algorithm that solves this problem without hiding legitimate sites. One problem to solve is to find the earliest post that all the others link to, and make it rank well in the search results. This can be hard, because sites often plagiarize without citing sources. Getting around this would require a fair amount of processing power. For all the link sites and scraper sites that just copy the text, checking a simple hash of the first paragraph would be a good start toward eliminating the cruft. This is a bit too simple to work well and or for long. Simply pulling part of a paragraph would defeat a hash. Another way to defeat it would be to replace a few words with synonyms, which is easy to automate. One way to deal with this is to use a thesaurus based index instead of indexing individual words. This would have the advantage of lowering the number of words that must be indexed. Based on some search results, I suspect Google already does this.

This would be an interesting problem to work on, but it isn’t one that I’ll ever get around to. I hope this post will help encourage someone else to work on it. So please, write a search engine that can show the earliest sources and has the option to not include ad sites! Until someone does, I’m using the partial solution of adding some of the worst offenders to the exclude list of a Google custom search.

Notes:

The founding principle of the World Wide Web: anyone from from anywhere can post anything useful to be freely shared. Paywalls and sites that exist to show ads are clearly against this principle.
Scraping content means writing a program to automatically copy parts of someone else's site. It has legitimate uses, but is often abused.
I originally wrote this post in September 2009. It is still hard for educated people to find a decent job.

Postscript (2010-11-30): The NYTimes posted a story that is a good example of why search engines that are more spam-proof are needed.