We use Nagios NMS here at Zocalo Data Systems to monitor all our production servers and equipment for problems and failures. It's a very versatile open source monitoring and notification platform that constantly tests our systems to make sure they are delivering services to our customers. The guiding mantra is "If that fails, how will we know about it before the customer calls us?" So we check everything we can: UPS's, hard drives, CPU utilization, you name it. If it misbehaves, we get an email telling us. The rub came one day when we realized, "Hey, what if the thing that's failed is email? Or the WAN link?" Ooops! Suddenly everything would fall silent and we'd think all's well. Not a good situation at all! So we developed an alternate notification channel for Nagios to use whenever the SMTP or WAN service checks fail.
When devising this alternate system, we realized it had to rely upon no resources common to email. We couldn't use the internet at all, and we needed another way to get an automated message to our techs besides email. We also wanted a system that had a low chance of failure, so that meant looking at lower-tech ideas with fewer "moving parts" and less room for error. We settled on a solution using analog modems and classic text pagers.
Pagers are considered dinosaurs. In the 80's, most techs wore them like status symbols, now they're more likely to hide them discreetly under their shirt so they don't get teased. But they are a great alternate system for several reasons: they use a completely separate communications infrastructure from cell phones and internet, that infrastructure is much more mature (i.e. more thorough coverage than, say 3G, and less likely to lose messages), and pages can be sent using a basic analog modem and POTS (plain old telephone service) line. Believe it or not, there's still a market for these things too. Doctors and some other professionals still use them, so the service won't be going away in the near future.
For our system, we used one of the classic external US Robotics modems with the 25-pin RS-232 port. To enable Nagios to send pages, we needed additional software that speaks Telocator Alphanumeric Protocol (TAP), the protocol for sending automated pages. We found a small, simple package called QPage filled the bill quite nicely. It has a queuing system, so Nagios can throw all kinds of pages at it and it won't overwhelm it. To get nagios to use this alternate channel, we first had to define a base contact object to be used by any contacts that expect to be reached by pager. We defined it as follows:
define contact{
name pager-base
service_notification_period 24x7
host_notification_period 24x7
service_notification_options w,u,c,r
host_notification_options d,r
service_notification_commands service-by-pager ; command to trigger page
host_notification_commands host-by-pager ; command to trigger page
email foobar@yahoo.com ; ignored since we aren't sending emails
register 0
}
Then we defined a contact and a contact group as follows:
define contact{
use pager-base
contact_name johnsmith
pager johnsmith-qp ; GOTTA HAVE THIS!
alias John Smith QPage
}
define contactgroup {
contactgroup_name pagers
alias Nagios QPage Contacts
members johnsmith
}
Next, we define the SMTP service check, using the pager contact group to override the default contacts that would be emailed.
define service {
use local-service
host_name NMS1
service_description SMTP
check_command check_smtp!20!60
notification_interval 60 ; How often do you want to get paged?
contact_groups pagers ; Use our new contact group
}
Lastly, we need to define the commands that will kick off the page. We used one command for host alerts and a separate command for service alerts, but I think they could be the same if you like.
define command {
command_name host-by-pager
command_line echo -e "From foobarnn$NOTIFICATIONTYPE$: host $HOSTNAME$ is $HOSTSTATE$ Info: $HOSTOUTPUT$ ($LONGDATETIME$)" | /usr/local/qpage/qpage -s localhost -P $CONTACTPAGER$ -f "" -m
}
define command {
command_name service-by-pager
command_line echo -e "From foobarnn$NOTIFICATIONTYPE$: $SERVICEDESC$ is $SERVICESTATE$ on $HOSTNAME$ Info: $SERVICEOUTPUT$ ($LONGDATETIME$)" | /usr/local/qpage/qpage -s localhost -P $CONTACTPAGER$ -f "" -m
}
As you can see, the qpage command accepts the pager text on stdin. We also pass it the Nagios variable $CONTACTPAGER$ which comes from the "pager" attribute in the contact object defined above. This maps to a pager definition in the qpage configuration file at /etc/qpage.cf This definition would look something like
...
pager=johnsmith-qp
pagerid=5551212 ; John's pager number
...
Now, for any service whose failure might prevent an email from being sent, just configure it to use the pager contact group!
The solution has a couple of weaknesses. The pager vs. email choice is an either-or choice for each service: You will not receive pages for services that are not configured to use the pager contact group, so if you lose email, you won't see those notifications. But if you get the "SMTP is down" page, hopefully that'll be good enough to get your attention. Another weakness is that the qpage command returns successfully when your page requests is queued, not when the page is actually sent. This makes for efficient command execution, but it means Nagios gets no confirmation whether the page was sent successfully. The phone line could have been disconnected, or the modem turned off and you would never know. To mitigate this, we created a "Pager" service on our Nagios host that sends a test page once per week.
It's not perfect, but this solution has been running very well for us at two locations for quite some time now. And it's cheap too! We got the modems for less than $5 a piece on eBay, and the pager and service run us $15 per month. The extra POTS line is the most expensive part at about $45 per month, but that's still worth it for the peace of mind it has brought us as we've experienced several email outages over that time.