T-SQL Tuesday #83 Systems Monitoring Systems

T-SQL Tuesday

For T-SQL Tuesday #83, Andy Mallon (b|t) asks "What's a problem you've been dealing with throughout your IT career?". That's a great question! To answer it, I'll pose a few questions of my own.

What do you do when you have an important IT system that needs to be running smoothly and continuously? Naturally, you monitor it. Whether it's a commercial solution, open-source, or something home grown, your monitoring system plays a very important role. What do you do when the monitoring system fails? Ah! An age old problem.

I was reminded of this recently in my little SQL Server world. I have a number of garden variety alerts set up, plus some other more custom monitoring stuff, which is mostly tied to DDL triggers and event notifications. The one thing all of them have in common is database mail. You can probably guess where I'm going with this. Yep, database mail stopped working. A couple weeks passed before I realized it. Fortunately, out of all the alerts I should have been notified about, none of them were serious.

How would I prevent this happening in the future? I guess I could build another system to monitor my monitoring system. Something like System C, which monitors System B, which monitors System A. But where would that end? System D? System E? Where should the line be drawn? I don't know that there's a right answer here, although admittedly, the farther into the alphabet you get, the more absurd it sounds.

I've decided that I'll draw the line with System C. And as much as I like automating all the things, System C will be me. In short, I'll be manually running a query on a daily basis. Like many DBAs, I have a "daily checklist" script of queries I run on all of my production environments. Adding another query to the script is not a burden:

SELECT m.recipients, m.subject, m.body, 
    m.send_request_date, m.send_request_user, m.sent_status
FROM msdb.dbo.sysmail_allitems m
WHERE m.send_request_date > CURRENT_TIMESTAMP - 3
AND m.sent_status <> 'sent'
ORDER BY m.send_request_date DESC

Nothing super special here. It's a simple query on sysmail_allitems to show me any mail items that failed to be sent over the last three days. Note the four possible values for sent_status: sent, unsent, retrying, and failed.

But back to the larger issue. I remember some old coworkers fretting over their software, running as a Windows service, failing unexpectedly. One of them joked they should create a second Windows service whose job was to restart the original service if it ever stopped. Years later, I took that idea myself and wrote a vbscript that would restart any Windows services that had stopped. Of course I used the Windows task scheduler to run the script periodically. And the task scheduler runs as a Windows service. Kind of a poetic irony, eh?

Systems monitored by systems, monitored by other systems, any of which can fail. It's an age old problem. I know I'm not the answer. I get sick and miss work. I can be forgetful. I (gasp!) sometimes write buggy code. I'm fallible. But everything else is too.