Monitoring is often quite an overlooked area in IT Services. In the world of service providers, it tends to be an afterthought and in the IT Department, it’s often not budgeted properly or doesn’t quite get the attention it deserves until something goes wrong.
For the most part, I normally break monitoring into six key points; these points are as follows;
- Making Sense of it all!!
The business view aims to distill into a short exec summary what the purpose of the platform we’re monitoring is for the customer. i.e. is it their entire IT Estate? Is it a critical website? Is it a marketing site/landing page with a key marketing campaign behind it? Who are the key users of the system? What’s the risk to reputation/financial impact of a failure?
The Human layer involves us looking at things that eyeballs see. Namely, what does a page look like in a browser? Does it render as expected? How about a user checkout journey through an e-commerce site? It took 3 minutes, 32 seconds for the last two months, now it takes 5 minutes 15 seconds, that’s outside our standard deviation? Why is that the case? How is this affecting users? The human layer in our opinion is the most important but quite often the most forgotten. When we get this right for customers, it just becomes part of their QA process. Some companies have got this right with their DevOps/Continual Delivery process, some still have work to do, others have legacy code and want to place some intelligence around it to help improve the current customer experience.
The Application layer involves understanding how key applications are functioning and responding. Are TCP Ports giving the correct responses? Do application health checks give a good response in a timely manner? Can we see processes running? Do we have any dead pids? Are the logs coming out the application ok or are we seeing increased error rates for errors/problems that we could monitor for – say an increased http 40*/50* count. With SQL, can we perform counts on a table and see if they appear to be at a steady size, have they changed dramatically?
The Network can involve tons depending on the level of detail, but at basics, this can be ensuring key services are working, are key SSL certificates valid or close to expiry? How about DNS Records, do they resolve correctly? Are time services delivering the correct response? What about Latency? How is the latency between two different cloud regions? How about bandwidth, are we using the same amount of traffic that we normally do, have we a spike on the outbound connectivity? If so, why and from where? Are key services like load-balancers (be it hosted or cloud based) doing what they’re supposed to?
Finally, we get to the Server. The server is typically where most people monitor today. Typically, we see CPU Usage, RAM Usage, Disk Partition Usage and checks across key operating system services. This is still a paramount area to understand. Metrics are key here and making sense of them is super important.
Finally, the complex bit is making sense of it all. We’ve seen several vendors say buy their tool and you’ll never need anything else for monitoring ever again. The truth of course is that good monitoring, especially with complex enterprise environments needs a collection of monitoring tools to capture the data from various different services and technologies.
The trick is to admit that and design that into your strategy. We often amalgamate complex information into dashboard views that are useful to engineering teams or business teams. The aim for companies like ours is to remediate issues in the lower Server, Network and Application areas before they spill over to the Human/Business Dashboard where they actually start to impact users.
My perspective is that Monitoring isn’t that expensive to get right, but having advanced telemetry across your IT Services is critical. It helps not only with day to day operation, but also capacity planning and business spend forecasting.
That data also helps when negotiating with vendors, you’ve a better idea of how much capacity/licensing you actually need and the monitoring services tend to pay for themselves.
If this is an area you’re currently struggling with, have a chat to us, we’re a small consultancy, we provide monitoring and platform management to a number of customers across On Premise, Hosted and Cloud Systems where they’ve been failed by existing suppliers.