+1 (646) 536-9268 VJakimov@Gmail.com

Well, if you have a somehow popular website, you are eventually going to bump into that problem. For those of you who are not familiar with Website Content Scrapers, I’ll explain that those are Web Bots just the same as the Search Engine Bots, but with the main difference that the content “stolen” from your websites is published fully,partially and even auto-modified on other websites. This type of attack is usually quite successful because the new website will start ranking and taking some of your traffic.

As far as I know, Google and Yahoo are still trying to figure out a good way to filter their results by improving their Duplicate Content Filters. You can imagine how complicated it is to separate “a authorized content copy” such as the manufacturers’ description on the products you sell. In current algorithm specifications we can say that a fair amount – nearly a full paragraph of duplicate content text – can be shown without any ranking penalty.

Now it comes to us, to defend our content and block all scrapers for once and for all!

Also I’d like to mention that this script MUST NOT be used if you’re offering any feeds. Some monitoring/statistics services may not work as well! I’m using it on almost 99% of all my sites nowadays and this solved my content scraper issue.

My personal solution to this is by filtering http user-agent header (and hope its not forged) then analyze it against known/accepted user agents or kills the connection with forbidden message:

$session["user_agent"]="  ".$_SERVER['HTTP_USER_AGENT'];
$browser=strtoupper($session["user_agent"]);
switch($browser){
        case (strpos($browser, "MSIE")>0) : { $session["browser"]="MSIE"; break; }	
        case (strpos($browser, "FIREFOX")>0) : { $session["browser"]="FIREFOX"; break; }					
        case (strpos($browser, "SAFARI")>0) : { $session["browser"]="SAFARI"; break; }
        case (strpos($browser, "CHROME")>0) : { $session["browser"]="CHROME"; break; }					
	case (strpos($browser, "OPERA")>0) : { $session["browser"]="OPERA"; break; }			
	case (strpos($browser, "MOZILLA")>0) : { $session["browser"]="MOZILLA"; break; }						
	case (strpos($browser, "GOOGLE")>0) : { $session["browser"]="ENGINE"; break; }					
	case (strpos($browser, "YAHOO!")>0) : { $session["browser"]="ENGINE"; break; }
	case (strpos($browser, "MSNBOT")>0) : { $session["browser"]="ENGINE"; break; }
       case (strpos($browser, "SITEMAPS")>0) : { $session["browser"]="ENGINE"; break; }	
        case (strpos($browser, "SITEMAPS")>0) : { $session["browser"]="ENGINE"; break; }
	default : {
	     // THIS IS UNKNOWN BROWSER. MOST PROBABLY SPAM BOT!
	     header('HTTP/1.1 403 Forbidden');
	     die("
THIS BROWSER VERSION IS CURRENTLY NOT SUPPORTED!

Please try using Internet Explorer, FireFox or Safari
"); break; } }

You can see that this is not the most elegant way of preventing bad bots, but its the only secure one. If I find some service or frequently updated list of Google Datacenter IPs I will let you know and modify this version to work with it, but that’s all I got now. If you want to use this code then please copy it in the header part of your php script. Note that in order to send the 403-Forbidden correctly (in case this is new browser or PDA) we show error message and support email or phone!!! If you ever get such call you should add this browser to the list as we need all the quality visitors we can get!

Well I think that’s more than I need to share with you today, so I’ll call it a day!

P.S. I forgot to mention that this type of filter also prevents 99% of all spam bots, link bots, exploit bots and so on. You get the picture.