Archive

Posts Tagged ‘uptime’

Automatically update iptables rules for Pingdom monitors.

March 14th, 2012 4 comments

Pingdom is an awesome service that tracks the uptime, downtime, and performance of websites (you can see an example of the public stats of this server here). If you have a firewall running on your system, you need to whitelist Pingdom’s servers or else their monitors will fail. As their servers may change at any given time, it is better to automate this whitelisting by realistically and responsibly polling the RSS feed of their monitoring servers.

It’s been done before, but this is how I have chosen to do it.

First, a little php helper script to extract the Active IP addresses of the monitors. For this example, let’s save it as pingdom.com.php alongside our bash script which will be executed by cron.

preg_match_all(
    '/((\d+\.){3}\d+).*?Active/',
    file_get_contents('https://www.pingdom.com/rss/probe_servers.xml'),
    $ips
);

echo implode("\n", $ips[1]);

Then, our bash script which is called from cron:

for ip in $(/usr/bin/php pingdom.com.php); do
    $IPTABLES -A INPUT -s $ip -p icmp -j ACCEPT
done

Of course, this is just an example and you will need to modify the firewall rule(s) according to your needs.

Clickatell Outage

March 13th, 2012 1 comment

Clickatell have had an outage and have sent their clients an e-mail about it, the beef of which I paste here:

We regret to inform you that Clickatell’s system experienced a total outage due to complete power failure at the data centre which hosts Clickatell’s services. The outage occurred on Sunday the 11th of March 2012 from 09:50 GMT+2 and was resolved at 15:00 GMT+2. During this time all Clickatell services were unavailable. No messages were accepted for delivery and the system was unreachable.

Root Cause:
Clickatell’s services are hosted at a third party data centre. Electrical contractors caused a power outage throughout the data centre’s building while performing routine investigative maintenance on the UPS systems.

As part of their maintenance they bypassed the entire UPS system in order to safely work on it, and powered all infrastructure through one of their standby generators. During the maintenance this generator failed, and the entire building lost power. This meant both our live and standby systems lost power simultaneously.

Clickatell have never struck me as a small company and I was always under the impression that they are used by the largest corporations of the world, such as Google for example. In light of that, how on earth could they have allowed for such a thing to happen?