How we do monitoring…
How we do monitoring…
I’ve been talking to a few people about how we do monitoring and why it’s different to normal monitoring at Factory Internet. I think sometimes it’s important to understand why something is done a certain way to really understand it.
I think firstly, part of my obsession with monitoring is because I’ve worked and consulted into several service providers that I don’t think have ever reached their potential in this space. In the world of Hosting, Managed Hosting and then Cloud, it always seemed to be a value add from a Cloud Provider, or something that you could do yourself.
In the ITIL days of service management, companies could sort of get away with this because their would be a very clear demarcation between operations and development people. The downside of course is that operations tend to run 24×7 and development run 9×5 hours for the most part. In this world, if something important to the operation of the system was only known by the development team, they’d be woken up to sort out a production issue. Every now and then, this can be acceptable, but if it happens a lot, it tends to mean the delivery windows that the development team are aligned to start to slide. That means you’re slower to get changes out the door and customer confidence in your operation starts to decline.
Of course, that world didn’t quite work and DevOps started to emerge from the ashes of failed development projects. This method of operation became even more obvious when services were App/Cloud based and an entire user base could be quickly and seamlessly updated. The process had to change to match the new way of working.
Monitoring for the most part seems to have stayed in the dark ages. I’ve met companies where a vendor has sold them a single pane of glass monitoring system which monitors precisely…. well, sod all really. I’ve been actively working within DevOps teams for a very long time now (~7 years) and arguably more if we look at how some of the start-up service providers I’ve worked with worked. In that model and mindset, you cannot simply have one monitoring system, it won’t do everything, it won’t monitor every angle and it won’t fit every use case.
The trick of course is to use several, but spend the time integrating them well. At Factory Internet, we’ve an adaptable monitoring platform that we bend at will for customer needs and use cases. We can monitor iLo/Drac Server Cards, we can monitor Cisco/Arista Switches, Blade Servers, Windows Operating Systems, Linux Operating Systems, the insides of JVMs, SQL Servers, Query Performance, Logs, Log Analytics, Page Speed, Page Objects, DNS, SSL, Cloud API Usage, Cloud Service Logins and more. We can monitor systems that on On-Premise, with a VPN, without a VPN, Not Connected to the Internet, Connected to the Internet, Hosted or in the Cloud. I don’t think there is one tool on the market that excels at all of that stuff, but when you use a mixture of API Enabled tools, you can then build common reporting and dashboard layers, common alerting layers and common query layers. The trick is to have the information easily searchable, but ultimately still give the engineers the flexibility of delving into a tool should it be required.
At Factory Internet, we wish the world would monitor this way. In DevOps, Site Reliability Engineers and Engineering teams are doing it this way, but in traditional IT, this just doesn’t happen. The insights we’ve gained for customers have been massive. Typically when we perform an assessment to look at an environment we find a bunch of errors and problems to fix right away, some of these problems could have caused significant downtime. Until we’d rolled our tooling across their environment, there was no way of knowing.
The story doesn’t end their though. We treat monitoring as an ongoing thing, it’s not something you setup and forget, it’s something you treat as live telemetry, it’s the output and the insight into what your business is doing electronically. It can give insights into marketing teams, it can help you plan capacity and capex spend, it can help you understand your customers busiest times and when you need to support them the most.
If you want to talk more about how data driven telemetry and monitoring can help digitally transform your business, get in touch!