Interesting article on search engine spam in EWeek - talks about Microsoft Research assertion that 93% of all blog comments are spam.
"The Redmond, Wash., software giant’s Cybersecurity and Systems Management Research Group has taken the wraps off Strider Search Defender, an experimental project that automates the discovery of search spammers through non-content analysis."
…."The Web is so badly spammed, you can find a spam site on just about every search query," said Yi-Min Wang, the researcher heading up the project at Microsoft, in an interview with eWEEK. "We think this approach can pinpoint the big spammers and use their own tactic against them."
….."According to data from Automatic Kismet, a tool that helps bloggers thwart comment spammers, a whopping 93 percent of all blog comments are spam. With Strider Search Defender, Wang’s team is taking a context-based approach that uses URL-redirection analysis to pinpoint spammers.
The method Spammers use is detailed below:
"During the early stages of the Microsoft research, Wang discovered that successful large-scale spammers create a huge number of "doorway pages" on reputable domains to trick search engine users into clicking on a fake site. It is well-known that Google’s BlogSpot, Yahoo’s GeoCities and AOL’s Hometown services are all used by spammers to create doorway pages.
The doorway pages are then spammed to millions of forums, blog comments and archived newsgroups, pushing the page up the search engine results for certain target keywords. A user clicking on a doorway-page link in search listings gets redirected to a target page controlled by the spammer or, in some cases, Wang explained, the browser is instructed to either redirect to or fetch ads listing operated by the spammer.
Microsoft’s solution:
"The Microsoft Research team is now proposing to treat each spam page as a dynamic program rather than a static page and use a "monkey program" to analyze the traffic resulting from visiting each page with an actual browser. "By identifying those domains that serve target pages for a large number of doorway pages, we can catch major spammers’ domains together with all their doorway pages and doorway domains," Wang explained.
But won’t that take an enormous about of bandwidth and time to do? How can any process keep a head of all the content that’s being created every moment?
"Strider Search Defender starts with a seed list of confirmed spam URLs and uses a homegrown tool called Spam Hunter to run link queries on search engines. This is an automated process that pinpoints the forums and guest books on which the known spam URLs were posted. On these pages, additional spam links are scrapped to automatically generate a list of spam URLs. To filter out false positives, Microsoft feeds the list of potential spam URLs to the Strider URL Tracer, a tool released earlier this year by Microsoft to help trademark owners find typo-squatting domains of their Web sites."
Apparently it’s the old 80/20 rule, a couple of domains are feeding most of the spam.
"In one scenario, Wang said the Spam Hunter collected more than 17,000 BlogSpot URLs and fed them into the URL Tracer. The group was able to identify the top 25 target-page domains that are behind the Google-hosted splogs. The top six are particularly active, Wang said, identifying them as s-e-arch.com, speedsearcher.net, abcsearcher.com, eash.info, paysefeed.net and veryfastsearch.com, which collectively were responsible for approximately 45 percent of the BlogSpot URLs.
Wang said the Strider Search Defender project has already helped to remove junk results from MSN Search. "The more widely spammed a URL is, the easier it is for the Spam Hunter to find it. Once a spammed forum is identified, it becomes a ‘HoneyForum’ that can be used to capture new spam URLs in new comment postings," he said. "Ideally, since there is a delay between spamming and its effect on search engine results, our spam hunter should be able to identify new spam URLs and notify the search engine before the URLs enter top search results."
I guess it goes back to being careful who links to your site and who your linking out to - as search engines become more aggressive if weeding out spam a site can become tagged by a Spam Filter without realizing it - and in fact, that had happened to a site I worked with in the past with MSN- it dropped out of the index for several weeks until the the issue was resolved.