Embracing Growth: Lessons From a Global IT Outage

Embracing Growth: Lessons From a Global IT Outage

By Eric Egolf, CEO

We are a little over 1 week after the CrowdStrike-related incident and the chaos that ensued from it. There are many, many articles that have already been written about this incident, so I don’t want to spend too much time rehashing what’s already out there. But I do want to give a quick synopsis of the situation, share some of the lessons we at CIO Solutions learned, and highlight some of the conversations we are anticipating to see continuing to unfold in the industry.

An Overview of What Happened

At its core, the cause of the incident was simple. This was not an external threat or breach of any kind; it was a software update that CrowdStrike, a leading security software provider, released to their product. This was a specific kind of update- to a driver, not just the software. It’s common for security vendors to do updates like this so customers don’t have control over whether or not they choose to update; pushing an update through on the driver ensures it goes through to everyone and is usually done to keep customers secure.

This particular update involved a bad driver at the kernel level (the heart of what the Windows operating system uses). When the update went through it rendered the system unusable until it could have a manual intervention to roll the update back. The manual nature of the fix required, in many cases, hands on keyboards and IT personnel in front of computers, a key reason it took so long to resolve.

The fallout was huge. The disruption was widespread (estimated at around 8.5 million Microsoft devices) and globally impacted the operations of organizations. The cost of damages is in the billions.

An event like this has never been experienced before. I like to think in terms of what we can learn from it; the insights we at CIO Solutions can gain to enhance our response abilities, the advancements vendors might make from this experience, and the overall industry knowledge that will now shape future conversations.

What We Learned At CIO Solutions

I can tell you that our staff had no idea when they left work on Thursday evening what they would be in for soon. When the issue was detected, our teams were called back in for a 2-day long, round-the-clock sprint of high-octane, high-stress, high-stakes work. That was not an experience they would choose to relive.

As with any first-time event, we uncovered some areas for improvement. Most of these growth opportunities are in the areas of prioritization and documentation.

Given the circumstances, I believe we did a pretty good job prioritizing which systems to focus on first to effectively divide and conquer remediation work. But the prioritization metric was intuitive and reactive, making it more ad-hoc than it would’ve been if we had time to proactively and intentionally plan how to approach it.

Likewise, for an event of this scale, our normal help desk documentation system was not ideal. With thousands of tasks being added to the list and changing rapidly, there are likely other more robust ways we could explore to keep track of the work, progress, and accountability in an incident like this. With the experience of this unique scenario now under our belt, we can continue to explore and evaluate these learning opportunities.

Vendor & Industry Lessons

On a more macro level, there are a lot of lessons to be learned for vendors and the industry overall. One of which is how vendors empower IT Admins. Any vendor that is providing any level of software updates to systems, whether they’re in the security space or not, is going to need to re-think how they provide their IT Admins tools to control this.

Another thing we’ve seen time and time again is vendors who experience a devastating event and come back stronger as a result. Again, we’ve never seen anything at this scale, so the story will continue to play out in a unique way. Regardless, the vendors involved will be rethinking the checks and balances on their quality assurance processes. They will be forced to reexamine how they are testing updates before they go out as well as better ways to stagger updates.

Even broader, questions around secure third-party access will be part of the future conversation. As part of a 2009 EU Commission ruling, Microsoft allowed for interoperability provisions that effectively allowed third parties (in this case, CrowdStrike) access to the “kernel” level. This level of access means third-party security tools like CrowdStrike can affect Windows devices at a deeper operational level. The ability to access this level of Windows devices was a core piece of this perfect storm, and the reason that specifically Microsoft devices were impacted. It’s worth noting that Apple has no such access-level requirement in the EU and operates in a different ecosystem. Whatever this ends up looking like, there will likely be conversations around regulatory requirements, and an evolution in better more secure ways to ensure interoperability and grant third party access.

In Conclusion

The silver lining for us at CIO Solutions is that any team worth its salt comes together in adversity, and we truly got to see that in real-time. This experience connected our team even more and brought to the forefront for everyone a reminder of how agile, capable, and dedicated their colleagues are. This type of event has never been seen before and they worked together under pressure to create the playbook on the fly.  I have to give our team an A+ for teamwork, creativity, and tenacity.

As for how vendors will recover and what new processes and requirements we can expect to emerge in the industry, that’s still unclear at this point. What ultimately shakes out from this event, only time will tell. One thing is for sure, I think the industry overall will continue demanding discussions and answers around these core issues. Hopefully, we will see more solutions that will ensure that IT departments and service providers are given the controls they need, while at the same time ensuring that even mistakes by their own people internally don’t have the unchecked ability to cause such widespread havoc.