It is important to ensure that Google does not index sites whilst they are still on a staging environment, but you cannot lock it down completely - how would your clients proof it? So I run a simple global rewrite rule in Apache that redirects all requests for robots.txt to a central disallow all response. This works great and Google appears to honour the rule as one would hope.
What happens though when something about that central file changes? One fateful night it so happened to occur on an old server I manage. Someone had altered the file and replaced it with an allow all rule. A site from the server started to appear in Google’s listings and thankfully it was picked up quickly, banned through Google webmaster tools and the original robots.txt put in place to protect against future indexing.
This left me needing a quick and dirty little monitoring script to keep an eye on the file. It really didn’t need to be anything crazy - just email me when the file changes so I can investigate what or whom changed it - tell them to desist - and revert it’s contents.
To do this I employed sha512sum
and mail
inside a simple cron
job that would
regularly compare the file’s hash against the known good hash. If the hashes do not
match then the script will email a short message to let me know to check into it.
Now of course you could just use the cron job to revert the contents automatically, but I wanted to look into why it was happening first. If you’re really worried you could of course replace the contents of the file and then email yourself. In this case it wasn’t so important.
There are plenty of command line tools to help you get a hash for a file - handy when you’ve downloaded something and you want to verify the integrity of file. It used to be common for open source projects to list hashes beside their downloads before GitHub. Anyway there are a number of choices with increasing length and therefore less collision prone (two different files creating the same hash):
md5sum
sha1sum
sha256sum
sha512sum
By default they’ll spit out the hash(es) onto the command line (STDOUT), but we’re going to redirect them to a file so we can refer to them later.
sha512sum robots.txt index.html > cron_sums.txt
This will create a text file containing two hash values that we can use to later
verify against the files in question. If you, later, take another hash of the files and it
doesn’t match the one in cron_sums.txt
then that file has changed. There is a
handy switch you can pass to sha512sum
that makes this process much easier.
sha512sum --status -c cron_sums.txt
The above command’s exit status code can be used to generate a human readable message
using a simple ||
(or) operator on the command line.
sha512sum --status -c cron_sums.txt && echo "Success" || echo "Failed"
The above command is pretty self explanatory so I won’t bother working through it
and I’ll move onto sending the email instead. This will be done by using the
venerable mail
command.
sha512sum --status -c cron_sums.txt && echo "Success" || echo "Failed" | mail -s "File hashes didn't match" [email protected]
Here the output messages are piped to mail and dutifully sent through to your inbox.
This works, but I’d like to only be disturbed when it goes wrong - I don’t care if
it succeeds. To do this we’ll pipe the success output to the /dev/null
blackhole.
sha512sum --status -c cron_sums.txt && echo "Success" > /dev/null || echo "Failed" | mail -s "File hashes didn't match" [email protected]
So we’ve worked out the bash command we want cron to run for us every minute of
the day. Let’s tell cron about it! Execute crontab -e
on the command line to
open the crontab in your default editor.
Now add the following cron job to the file.
* * * * * /usr/bin/sha512sum --status -c cron_sums.txt && echo "Success" > /dev/null || echo "Failed" | mail -s "File hashes didn't match" [email protected]
It is worth noting that the paths in the cron_sums.txt
file are all relative so
you may need to change into the directory containing the files you want to check
before running the sha512sum
command from cron. Also cron will run in the user’s
home directory by default.
* * * * * cd /var/www; /usr/bin/sha512sum --status -c cron_sums.txt && echo "Success" > /dev/null || echo "Failed" | mail -s "File hashes didn't match" [email protected]
It isn’t pretty and it certainly doesn’t scale (although you could email a list/forwarding group), but it does serve as a quick and dirty fix to warn you of file inconsistency.
As a bonus; to automatically revert the file as well you could add the following to the crontab.
* * * * * cd /var/www; /usr/bin/sha512sum --status -c cron_sums.txt && echo "Success" > /dev/null || echo "Failed" | mail -s "File hashes didn't match" [email protected] && echo "Disallow: All" > robots.txt
Whilst this is a very simple one-liner example you could of course use the same principles to write a simple little bash script that would be triggered by failure instead.
* * * * * cd /var/www; /usr/bin/sha512sum --status -c cron_sums.txt && echo "Success" > /dev/null || /usr/scripts/files_changed.sh