Before switching my site to use a CMS (content management system), I made use of hidden link and a PHP script to detect bad robot and email scraping behavior. At first I was playing safe and not automatically blocking anything that hit the hidden page, but later I decided to continue to only receive the email alerts and manually add the blocks to '.htaccess' when needed. Now that I've switched to a CMS, Drupal at this time, I wondered how I might continue to use this bot trap system.
Exactly what are "bad bots" and "email scrapers"? Most people use the term "bad bot" to refer to an automated program, a robot, that reads, aka spiders, your web site but does not obey what you put in the site's 'robots.txt' file. The program can be one like search engines, such as Google, use to index your site. It could also be a program used by companies that sell services claiming to protect another company's image or brand names. Most of us don't mind these programs, so long as they behave - ie the don't hammer the site (by calling pages as fast as they can) and they don't go where the robots.txt file tells them not to go. "Email scrapers" are just another type of robot program, but their main goal is to copy email addresses off any page in your web site to use in sending spam.
First thing most folks do is see if anyone has already done the same thing. Unless you are learning, why do it all over again? A little searching turned up the http:BL module, which uses part of Project Honey Pot. While it is a very nice module, and I encourage folks to contribute to Project Honey Pot if they can, I felt it was a little over-kill for me at this time. I may wind up switching to this later, however, and encourage you to check it out if you are using Drupal.
My original setup was based on a PHP script at KLOTH.net - http://www.kloth.net/internet/bottrap.php. Here I'll show you how I've adapted it for use in Drupal. I considered making a module, but haven't gotten around to it.
First thing to do is to create the trap page. This page will be where the hidden link goes to and is what writes the IP of the offender in a text file. I also have mine set to send an email on the first access. If you did it for every time, you'd be setting yourself up for a potential email flood.
I made a separate directory, called 'bot-trap', and placed a PHP file in it called 'index.php'. The file, 'index.php' starts with the usual '<html> <head> ... <title>Trapped</title> </head> <body>' and then some body text giving a human-friendly what this page does summary. After that is the following PHP code, wrapped in '<?php' and '?>' tags:
// PHP version must be 4.2.0 or greater extract($_SERVER); $badbot = 0; /* scan the blacklist.dat file for addresses of SPAM robots to prevent filling it up with duplicates */ $filename = "../files/blacklist.dat"; $fp = fopen($filename, "r") or die ("Error opening file ... <br />\n"); echo "<p>Checking to see if you are in our list..."; while ($line = fgets($fp,255)) { $u = explode(" ",$line); if (strcmp(trim($u[0]),trim($REMOTE_ADDR)) == 0) { $badbot = 1; break; } } fclose($fp); if ($badbot == 0) { /* we just see a new bad bot not yet listed ! */ echo "<p>You are not currently in our list."; /* Format date-time like apache log */ $tmestamp = time(); $datum = date("d/M/Y:H:i:s O",$tmestamp); /* Send an email */ $from = "badbot-watch@[your_domain]"; $to = "webmaster@[your_domain]"; $subject = "[your_domain] alert: bad robot"; $msg = "A bad robot hit /bot-trap/index.php $datum \n"; $msg .= "address is $REMOTE_ADDR, agent is $HTTP_USER_AGENT\n"; mail($to, $subject, $msg, "From: $from"); /* append bad bot address data to blacklist log file: */ $fp = fopen($filename,'a+'); fputs($fp,"$REMOTE_ADDR - - [$datum] \"$REQUEST_METHOD /bot-trap/index.php $SERVER_PROTOCOL\" 200 - \"$HTTP_REFERER\" $HTTP_USER_AGENT\n"); fclose($fp); echo "<p>The following information has been recorded<br>"; echo "ip address is " . htmlentities($REMOTE_ADDR); echo "<br>request method is " . htmlentities($REQUEST_METHOD); echo "<br>referring url is " . htmlentities($HTTP_REFERER); echo "<br>server protcol is " . htmlentities($SERVER_PROTOCOL); echo "<br>user agent is " . htmlentities($HTTP_USER_AGENT); } else { echo "<p>Your IP ($REMOTE_ADDR) is already in our list."; }
And ends with '</body> </head>'.
Adjust the location of the file in '$filename', covered below, and the email parameters '$from' and '$to' as necessary.
Next step is to create an empty 'blacklist.dat' file, for the captured information to be written to. Since you most likely have a 'files' directory already, that you allow web server executed scripts to write files to, I would suggest placing it there mostly for consistency. However, the file itself must be made writable as well.
If you have shell access to your web directory:
$ cd [to_drupal_directory]
$ cd files
$ touch blacklist.dat
$ chmod o+w blacklist.dat
Note that if your web server is configured to run PHP as your user, you will probably not need to perform the chmod step.
If you do not have shell access to your web directory and you use Windows, you can use Notepad to make an empty text file called 'blacklist.dat' and upload that. Then use your web host's tools to change the file's permissions so that 'other' can write to the file. If, when configuring Drupal, you did not need to make the files directory writable by 'other', then you will not need to make the 'blacklist.dat' file writable.
You should now be able to test the trap page. This will capture your IP and you will need to remove it from the 'blacklist.dat' file afterwards.
Access 'http://[your_domain]/bot-trap/index.php'. You should see a page with whatever text you placed before the PHP code and then you should see some text like:
Checking to see if you are in our list... You are not currently in our list. The following information has been recorded ip address is [your_ip] request method is GET referring url is server protcol is HTTP/1.1 user agent is [your_browser_info]
And then you should also receive an email, if you kept that part of the code. Accessing the page again, before you remove your IP from 'blacklist.dat', would have the text:
Checking to see if you are in our list... Your IP ([your_ip]) is already in our list.
And you will not receive an email. To remove your IP from the file, you can use VI if you have shell access. Alternatively, delete the 'blacklist.dat' file and re-create it as a blank file (don't forget the permissions if necessary).
If you get an error for opening the file, check the path in '$filename'. It should be relative to the location of this PHP page.
If your IP is not recorded in 'blacklist.dat', make sure the permissions are set correctly on the file.
In your site's robots.txt file, disallow access to the trap page so that good bots won't access the page. Shortened example
# robots.txt # # This file is to prevent the crawling and indexing of certain parts # of your site by web crawlers and spiders run by sites like Yahoo! # and Google. By telling these "robots" where not to go on your site, # you save bandwidth and server resources. # ... # Paths (clean URLs) # ... Disallow: /bot-trap/ # ...
You'll probably notice I disallowed anything in the bot-trap directory. Call it a bit of a teaser to some bots that read the robots.txt file looking for hidden directories.
Create a transparent gif file that is 1x1 pixels (mine is 2x2 just because I felt like it). Call this file bot-trap.gif and upload it to your 'files' directory. If you want to place it else where, be sure to adjust the '<img scr="...">' below.
Next step is getting Drupal to read this file to see if the currently accessing IP is in it and then do something if so. Read all of this so you know what steps to take to still allow you to at least login and disable this if your IP should be recorded.
- Login to your Drupal site with the rights to create a block and use PHP code.
- Go to Administer -> Site building -> Blocks.
- Select "Add Block", at the top.
- For "Block description", I called this "Bot Trap".
- For "Block body", inside of '<?php' and '?>' tags place the following code:
/* From http://www.kloth.net/internet/bottrap.php */ extract($_SERVER); // PHP version must be 4.2.0 or greater $badbot = 0; /* look for the IP address in the blacklist file */ $filepath = variable_get('file_directory_path', 'files'); // drupal function $filename = $filepath . '/blacklist.dat'; $fp = fopen($filename, "r"); if ($fp) { while ($line = fgets($fp,255)) { $u = explode(" ",$line); if (strcmp(trim($u[0]),trim($REMOTE_ADDR)) == 0) { $badbot = 1; break; } } } fclose($fp); if ($badbot > 0) { /* this is a bad bot, reject it */ sleep(30); /* slow down the bot a bit */ echo '<p>Access to this site has been reduced, as your current IP was previously involved in site abuse. If you are on a dynamic IP, please use the Contact form to request removal.</p>'; } else { /* Else give the link to the bot-trap page with a transparent gif file */ echo '<p><a href="/bot-trap/index.php"><img src="/files/bot-trap.gif" border="0"></a></p>'; }
- Set the "Input format" to "PHP code".
- Save the block.
- Before activating the block, click "configure" to the right of the block name to add additional settings.
- Leave "Block title" empty.
- Under "Show block for specific roles", you'll want to check "anonymous". No need running this for logged in users.
- Under "Show block on specific pages", I decided to select "Show on every page except the listed pages" and then put in the "Pages" box:
admin/*
user/login
users/*
contact
This was to make sure I could still login (user/login) and disable the block (admin/*). Also so that anonymous visitors could use the contact page. You might wonder why, since the code only stalls the page for 30 seconds. Call it good coding practice if you want.
- Save the block.
- You can now choose to activate the block. I place it in the footer, due to the text I'm printing if the accessing IP is in the list. You might change this and it might work good in your header.
Testing that your block works can be tricky. If you've followed the above steps and disallowed the login and admin pages, you should be fairly safe - not that I'm guaranteeing anything. If you were not able to remove your IP after testing the trap page earlier, then you shouldn't have come this far. You have been warned!
If you did check "anonymous" for "Show block for specific roles", and you want to test from your current computer, then just fire up another browser while staying logged in with your primary browser. This does not mean running another instance of IE or Firefox if that is what is already running, but instead it means using IE if Firefox is what you already have running. Since clicking that little 1x1 gif is hard (you could use [tab] to access it) I usually temporarily place some text after the '<img scr="...">' and before the '</a>' in the block code. Alternatively you could manually point your browser to the '/bot-trap/index.php' URL as you did in the earlier test or even manually add your IP to the 'blacklist.dat' file, if you are comfortable editing it directly.
Once your IP is in the 'blacklist.dat' file, access as an anonymous user to any page, except those excluded, should take about 30 seconds. After you remove your IP access speed as anonymous should be normal speed. Don't forget to remove any temporary text you placed in the trap link, if you did so.
Every so often, depending on how frequently you get hits, you'll want to remove older entries from the 'blacklist.dat' file. This is primarily for speed. Any IPs you want to keep blocked should be transfered to '.htaccess', again for speed. If you run across any unique user agents, those can also be blocked via '.htaccess'. You could choose to use a SQL table to store the hit information. I choose the text file method so that I don't have to remember the SQL calls to remove an entry.
You might wonder why I chose to delay page access rather than blocking it outright. If you remember earlier I said that on my old, static page site I only used the trap page to get email information and then manually added anything I felt worth it to the '.htaccess' file. Call the delay a splitting the difference between that and out right blocking. This will at least slow those bots that hammer the site. For email scrappers I use other method of protecting email addresses. Adjust the delay by changing the 'sleep() value in the block code.
If you really want to block, then see the KLOTH.net page and their blacklist.php include file code.