Screwtape

Screwtape at


I don't have any production responsibilities right now, but when I did, it was always a challenge to figure out what led up to a particular failure and what might have caused it—lots of grepping of logs, and applying hard-won knowledge of all the intricate steps of whatever process had failed. Having some kind of correlator tool to automatically record that these log messages came from that process, as part of answering this query would have been great.

The video actually goes one step further, demonstrating a system that not only finds all the details of a particular problematic query, but can find and summarize all the other queries contending for the same resources, which is amazing: usually when something fails, it's not because it actually broke, it's because some other thing changed the environment in an unforeseen way.

Unfortunately, this is more a design pattern than an actual product (proprietary or open source), but I'm definitely going to keep it in mind the next time I have to add monitoring to something.