Jun 9

Written by: Brian Connell

Last week, I visited a Tier 1 European Telco, and spent some time with their Technical Operations director. We discussed the growing trend towards outsourcing, and in particular the difficulty that executives have in placing their trust in an external company running all of their IT systems and providing at least the same level of service performance as before the outsourcing occurred. A pretty basic requirement, I’d say. But it appears that traditional monitoring tools, focussed on infrastructure and the like, are really designed for “reactionary” organizations – that is organizations that wait until a problem has occurred and then rush to fix the problem. It struck me that this type of monitoring is binary in nature. Its alive = everything is OK. It’s not alive = react and fix it. When you think about it, traditional monitoring nearly always fits this pattern. This model also carries through to performance monitoring. Each single event is examined and is compared to a threshold – some value that equates to “Below the line = OK – but above the line = react and fix”. So I equate traditional monitoring with single-event threshold based monitoring for reactionary organizations.

The executive in question described the ideal solution as one that could predict what was going to happen next. I don’t think he meant a crystal ball for taking to the racetrack or picking next weeks Lotto numbers – but what he described as the ideal solution was one that would inherently understand what was considered “normal”, and to also understand what was considered “abnormal” – from a business activity point of view. By looking for patterns of abnormal activity over shorter periods of time, disasters can be avoided by spotting early warning signs and fixing smaller problems before they turn into giant ones. His ideal solution has a “Big Brother” feel to it, where all behaviour is analysed and examined and hit teams immediately dispatched to fix anomalies.

It made sense to me and gave me a lot of food for thought. What I like is the top-down approach – which made me ask the question “Why do we monitor bottom-up? How do I know for sure that the technical problem is causing a business impact?”.

Do I even care about a technical problem that doesn’t cause any business impact? Very Zen-like. Reminds me of trees falling in forests.

Brian Connell

CTO

2 Responses

  1. Dave Allen Says:

    interesting post. Keep ‘em coming!

    re collecting data that doesnt cause a business impact, I guess the falling tree doesn’t matter as long as all the trees in the forest dont fall!

  2. Brian Connell Says:

    Hi Dave, that’s pretty much the way we see it. We often find that IT operations are cut off from the real business processes and activities, and rarely understand what metrics can correlate technology performance with business performance. So we end up with a situation whereby they only focus on key technical metrics like 99.9% uptime, or creating a performance SLA that averages all performance over an hour.

    It’s a bit like driving a delivery truck remotely by only monitoring the cockpit controls such as the speedometer and the fuel gauge. You end up waiting for the customer to call to tell you if the wrong parcel was delivered, or the wrong destination, etc.

Leave a Comment

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.