To avoid the next major outage, we need continuous testing

Thu, 17th Oct 2024

FYI, this story is more than a year old

CLARK WHITTY Director of Product Management, Automated Test & Assurance Spirent

In the months since the massive CrowdStrike/Microsoft outage last July, we've learned a great deal about what went wrong. A major cybersecurity provider pushed out a flawed update to its widely deployed enterprise endpoint protection product. Despite being (erroneously) cleared for release, the update caused Windows systems across the globe to crash and prevented them from recovering naturally from a reboot. In just minutes, 8.5 million devices worldwide were showing the dreaded "blue screen of death." Many would stay that way for hours.

There are several important lessons we can draw from this incident, but let's focus on two of them. First, even well-intentioned software updates can lead to catastrophic outages that halt business continuity. In this case, thousands of organizations worldwide had to halt operations. Financial firms, government agencies, and airlines were particularly affected, with thousands of flights canceled and countless business transactions disrupted. Direct losses among Fortune 500 enterprises alone will exceed $5 billion.

The second big lesson: there is still no substitute for proactive, comprehensive testing. As the enterprise software stack grows more complex and interdependent, we can't assume that any new update or version release—for internal products or the hundreds of third-party components an organization might use—will be safe to deploy. What we can do, however, is ensure that proactive, automated testing is embedded into change management tooling. This won't eliminate software conflicts. But implemented effectively, continuous testing (CT) ensures that when an update does fail, you know about it long before it affects your users.

The Promise and Peril of Automation
Use the term "DevOps" outside the technology industry, and most people will have no clue what you're talking about. Yet the advent of continuous integration/continuous delivery (CI/CD) models and the ability to perpetually expand a product's capabilities via ongoing software updates has been downright revolutionary. Based on independent DevOps surveys, organizations with the most advanced CI/CD practices deploy 208x more frequently, with 106x faster lead times than those without.

At the same time, though, automating software changes is like driving a Formula 1 car. The faster you're moving, the more serious the consequences if something goes wrong. Put simply, the more you accelerate software delivery, the more important it becomes to have effective safety measures in place. That starts with proactive, comprehensive testing that's tightly integrated with DevOps toolchains, executing automatically as part of the CI/CD pipeline.

Unfortunately, there is no single, standardized way to implement CI/CD, and different organizations take very different approaches. In too many cases, the testing elements of DevOps don't get the emphasis they should. Test cases aren't as comprehensive as they should be, or they're not conducted early or often enough—especially given the downsides of a serious issue being missed. Bottom line, the way you approach CT can make or break your CI/CD implementation.

Inside Continuous Testing
An effective CT framework embeds testing directly into the delivery pipeline and automatically invokes test cases throughout the software lifecycle, from early development through release. Such a framework employs a much larger pool of tests, and executes them more frequently than standard QA testing. It establishes multiple pass/fail data points aligned with predefined requirements. And it's fully orchestrated and integrated with the CI/CD pipeline, ideally via cloud-based lab and testing infrastructure that can scale elastically as needed.

When organizations adopt continuous automated testing, they can get immediate feedback on new software updates or version releases and better identify potential conflicts and risks. By building up the CT components of CI/CD, organizations can:

Detect problems earlier: Massive outages are not only caused by malicious attacks. As the CrowdStrike incident demonstrates, simple errors (in this case, a glitch in the tool used to perform validation checks) can be catastrophic. When organizations have a large pool of repeatable tests, automatically and repeatedly executing across the delivery the pipeline, they can identify most issues long before they're pushed into production.
Improve stability and security of IT systems: When organizations automate testing, they gain new capabilities to monitor performance and detect potential issues more accurately. By capturing baseline KPIs of network performance and security posture, for instance, they can quickly identify when those KPIs are drifting after a change.
Accelerate time-to-delivery: Automated testing, especially when implemented with on-demand lab-as-a-service (LaaS) and testing-as-a-service (TaaS) solutions, can provide an immediate agility boost. Organizations can scale testing resources wherever and whenever they're needed and push updates more quickly without undue risk.
Improve overall effectiveness: Organizations that implement CT effectively don't just move more quickly; they operate with higher productivity and deliver better quality and compliance. They also achieve massive cost savings by avoiding regulatory fines and lawsuits, identifying problems early in the development cycle.

Envisioning Effective CT
There are as many different DevOps frameworks as there are companies, so each CI/CD pipeline is unique. Nevertheless, effective CT implementations share core elements in common. The most successful software organizations employ testing that is:

Comprehensive: Effective CT processes start with a large pool of repeatable tests, and automatically execute them across many short cycles. Along those lines, organizations should be able to quickly spin up a wide range of operating systems that software might be deployed on to validate impact and quality.
Controlled: Organizations should only deploy updates under sufficient control to ensure that all changes are authorized, intentional, and perform as expected. That principle should apply to all third-party elements of the software stack, as well as internal products, to mitigate supply chain risks.
Continuous: Automated testing should be invoked throughout the software delivery pipeline and, once implemented, validated via active testing. By testing patches and version updates with synthetic traffic under realistic loads, organizations can identify problems in the post-implementation phase more quickly before they affect users and customers.

Looking Ahead
The days of most enterprise software getting updates two or three times a year are over, and they're not coming back. We're officially in a brave new DevOps world, where organizations can continually bring ever-more innovative software capabilities to users and customers. But if we want to fully realize the benefits of this newfound agility, we have no choice but to make sure we're implementing software delivery pipelines as safely and responsibly as possible. Make sure you're not shortchanging the CT aspects of CI/CD.

ChatGPT

Key takeaways Explain why it matters Create action plan Future watch

Claude

Key takeaways Explain why it matters Create action plan Future watch

Perplexity

Key takeaways Explain why it matters Create action plan Future watch

Grok

Key takeaways Explain why it matters Create action plan Future watch

Share Share

Add us as a preferred source on Google