Wednesday, January 22, 2025

Q&A: Lessons NOT learned from CrowdStrike and other incidents

When an event like the CrowdStrike failure literally brings the world to its knees, there’s a lot to unpack there. Why did it happen? How did it happen? Could it have been prevented? 

On the most recent episode of our weekly podcast, What the Dev?, we spoke with Arthur Hicken, chief evangelist at the testing company Parasoft, about all of that and whether we’ll learn from the incident. 

Here’s an edited and abridged version of that conversation:

AH: I think that is the key topic right now: lessons not learned — not that it’s been long enough for us to prove that we haven’t learned anything. But sometimes I think, “Oh, this is going to be the one or we’re going to get better, we’re going to do things better.” And then other times, I look back at statements from Dijkstra in the 70s and go, maybe we’re not gonna learn now. My favorite Dijkstra quote is “if debugging is the act of removing bugs from software, then programming is the act of putting them in.” And it’s a good, funny statement, but I think it’s also key to one of the important things that went wrong with CrowdStrike. 

We have this mentality now, and there’s a lot of different names for it — fail fast, run fast, break fast —  that certainly makes sense in a prototyping era, or in a place where nothing matters when failure happens. Obviously, it matters. Even with a video game, you can lose a ton of money, right? But you generally don’t kill people when a video game is broken because it did a bad update. 

David Rubinstein, editor-in-chief of SD Times: You talk about how we keep having these catastrophic failures, and we keep not learning from them. But aren’t they all a little different in certain ways, like you had Log4j that you thought would be the thing that oh, people are now definitely going to pay more attention now. And then we get CrowdStrike, but they’re not all the same type of problem?

AH: Yeah, that is true, I would say, Log4j was kind of insidious, partly because we didn’t recognize how many people use this thing. Logging is one of those less worried about topics. I think there is a similarity in Log4j and in CrowdStrike, and that is we have become complacent where software is built without an understanding of what the rigors are for quality, right? With Log4j, we didn’t know who built it, for what purpose, and what it was suitable for. And with CrowdStrike, perhaps they hadn’t really thought about what if your antivirus software makes your computer go belly up on you? And what if that computer is doing scheduling for hospitals or 911 services or things like that? 

And so, what we’ve seen is that safety critical systems are being impacted by software that never thought about it. And one of the things to think about is, can we learn something from how we build safety critical software or what I like to call good software? Software meant to be reliable, robust, meant to operate under bad conditions. 

I think that’s a really interesting point. Would it have hurt CrowdStrike to have built their software to better standards? And the answer is it wouldn’t. And I posit that if they were building better software, speed would not be impacted negatively and they’d spend less time testing and finding things.

DR: You’re talking about safety critical, you know, back in the day that seemed to be the purview of what they were calling embedded systems that really couldn’t fail. They were running planes and medical devices and things that really were life and death. So is it possible that maybe some of those principles could be carried over into today’s software development? Or is it that you needed to have those specific RTOSs to ensure that kind of thing?

AH: There’s certainly something to be said for a proper hardware and software stack. But even in the absence of that, you have your standard laptop with no OS of choice on it and you can still build software that is robust. I have a little slide up on my other monitor from a joint webinar with CERT a couple of years ago, and one of the studies that we used there is that 64% of vulnerabilities in NIST are programming errors. And 51% of those are what they like to call classic errors. I look at what we just saw in CrowdStrike as a classic error. A buffer overflow, reading null pointers on initialized things, integer overflows, these are what they call classic errors. 

And they obviously had an effect.  We don’t have full visibility into what went wrong, right? We get what they tell us. But it appears that there’s a buffer overflow that was caused by reading a config file, and one can argue about the effort and performance impact of protecting against buffer overflows, like paying attention to every piece of data. On the other hand, how long has that buffer overflow been sitting in that code? To me a piece of code that’s responding to an arbitrary configuration file is something you have to check. You just have to check this. 

The question that keeps me up at night, like if I was on the team at CrowdStrike, is okay, we find it, we fix it, then it’s like, where else is this exact problem? Are we going to go and look and find six other or 60 other or 600 other potential bugs sitting in the code only exposed because of an external input?

DR: How much of this comes down to technical debt, where you have these things that linger in the code that never get cleaned up, and things are just kind of built on top of them? And now we’re in an environment where if a developer is actually looking to eliminate that and not writing new code, they’re seen as not being productive. How much of that is feeding into these problems that we’re having?

AH: That’s a problem with our current common belief about what technical debt is, right? I mean the original metaphor is solid, the idea that stupid things you’re doing or things that you failed to do now will come back to haunt you in the future. But simply running some kind of static analyzer and calling every undealt with issue technical debt is not helpful. And not every tool can find buffer overflows that don’t yet exist. There are certainly static analyzers that can look for design patterns that would allow or enforce design patterns that would disallow buffer overflow. In other words, looking for the existence of a size check. And those are the kinds of things that when people are dealing with technical debt, they tend to call false positives. Good design patterns are almost always viewed as false positives by developers. 

So again, it’s that we have to change the way we think, we have to build better software. Dodge said back in, I think it was the 1920s, you can’t test quality into a product. And the mentality in the software industry is if we just test it a little more, we can somehow find the bugs. There are some things that are very difficult to protect against. Buffer overflow, integer overflow, uninitialized memory, null pointer dereferencing, these are not rocket science.


You may also like…

Lessons learned from CrowdStrike outages on releasing software updates

Software testing’s chaotic conundrum: Navigating the Three-Body Problem of speed, quality, and cost

Q&A: Solving the issue of stale feature flags

Related Articles

Latest Articles