Warning: really long post. More of an article. This is a reasonably new and barely tested train of thought, so all feedback is most welcome! Also I do not know as much as I’d like about Ops, so please feel free to educate me.
A capability is more than just being able to do something.
The word which describes being able to do something is ability. I do sometimes use this while describing what a capability is, but there are connotations there that are missing.
When we say that someone is capable, we don’t just mean that they can do something. We mean that they can do it well, competently; that we can trust them to do it; that we can rely on them for that particular ability. The etymology comes from the Latin meaning “able to grasp or hold”. I like to think of a system with a capability as not just being able to do something, but seizing the ability, grasping it, triumphantly and with purpose.
Usually when I talk about capabilities, I’m talking about the ability of a system to enable a user or stakeholder to do something. The capability to book a trade. The capability to generate a report. These are subtly different to stakeholder’s goals. The trader doesn’t want to book a trade; he wants to earn his bonus on the trades he books. The auditor is responsible for governance and risk management; his goal is not to read the report, but to ensure that a company is behaving responsibly. The capabilities are routes to that.
We’re most familiar with discrete capabilities.
In order to deliver capabilities, we determine what features we’re going to try out, and at that point we start moving more firmly into the solution space. Capabilities are a great way of exploring problems and opportunities, without getting too far into the detail of how we’ll solve or exploit them.
Features, though, are discrete in nature. Want to book a trade? Here’s the Trade Booking feature. Auditing? Here’s the Risk Report Generation feature. For each of these features, a user starts from a given context (counterparties set up, trades for the year available), performs an event (using the feature) and looks for an outcome (the trade contract, the report).
For anyone into BDD, that will be instantly familiar:
Given a context
When an event happens
Then an outcome should occur.
But what about those capabilities which can’t easily be expressed in that easy scenario form?
- The capability to support peak time loads.
- The capability to prevent SQL injection attacks.
- The capability to be easily changed.
Those are, of course, elements of performance, security and quality, also known as non-functionals.
A system which requires these continuous capabilities must always have them, no matter how much it’s changed. There is no particular event which triggers their outcome; no context in which their behavior is exercised, save that the system is operational (or in the case of code quality, merely existent).
Because of that, it’s harder to test that they’ve been delivered. Any test needs to be carried out regularly, as any changes to the system stand a chance of changing the outcome, and often the result of the test isn’t a pass or fail, but a comparison which notifies at some particular threshold. If the system has become less performant, we can always make judgement calls about whether to release it or not. Become a bit hard to change? Oh, well. Slightly less performant than we were hoping for? Let’s fix it later. Open to SQL injection attacks? Um… okay, that’s slightly different. Let’s come back to that one later, too.
There’s a name for a test that we perform regularly.
It’s called monitoring.
The capability to monitor something is discrete.
While it’s hard to describe the actual capabilities, most monitoring scenarios can be expressed easily. Let’s have a conversation with my pixie, and see what happens.
Thistle: So you want this system to be performant, right?
Thistle: And we’re going to monitor the performance.
Thistle: What do you want the monitor to do?
Me: I want it to tell me if the system stops being able to support the peak load.
Thistle: Okay. Give me an example.
Me: Well, let’s say our peak load is 10 page impressions / second, and we want to be able to serve that in less than 1 second per page. I don’t know what the actual figures are but we’re finding out, so this is just an example…
Thistle: That’s fine; I wanted an example.
Me: Okay. So if the load starts ramping up, and it takes longer than 1 second to serve a page, I want to be notified.
Thistle: So if I write this as a scenario… scuse me, I’m just going to change your three “ifs” to “givens” and a “when”, and add your silent “then”. So it looks like this:
Given a peak load of 10 pages per second
And a threshold of 1 second
When it takes longer than 1 second to serve a page
Then Liz should be notified.
Thistle: Do you want this just in production, or do you need to know about the performance while we’re writing it too?
Me: Well, I need to know about it in production, because then we’ll turn off some of the ads, but that will make us less money so ideally this will never happen. You guys should make sure it supports peak load before it goes live.
Thistle: Okay, so we can use overnight performance tests as a way of monitoring our changes in development. For yours, though, if we’re already sure we’re going to deliver something that meets your peak load requirements, why do we need to worry?
Me: Well, I might end up with more than the peak loads we had before. If the site is really successful, for instance.
Thistle: Ah, okay. So you want to be notified if the page load time rises, regardless of load?
Thistle: So your monitoring tool looks like:
Given the system is running
When it takes longer than 1 second to serve a page
Then Liz should be notifed.
Me: That would work.
Me: Are you going to make this magically work, then?
Thistle: Will you pay me in kittens?
Me: What? No!
Thistle: Probably best to get the dev team to do it.
Some things are hard to monitor.
There are some aspects of software which are really hard to monitor. Normally these are the kind of things which result in chaos if the capability is negated, though that’s a symptom rather than a cause of the difficulty. Security is an obvious one, but hardware reliability and data integrity spring to mind. And remember that time Google started storing what the Google Car heard over open wifi networks? Even “being legal” is a capability!
When a capability is hard to monitor, here are my basic guidelines:
- Learn from other people’s mistakes
- Add constraints
- Add safety nets
- Monitor proxy measurements.
Learning from other people’s mistakes means using industry standards. Persistence libraries are already secured against SQL attacks. Password libraries will salt and hash and encrypt authentication for you.
Constraints might include ensuring that nobody has permission for anything unless it’s explicitly given, refusing service to a particular IP address, or in the worst case scenario, shutting down a server.
Safety nets might include backups, rollback ability, or redundancy like RAID arrays. Reddit’s cute little error aliens ameliorate the effects of performance problems by making us smile anyway (and then hit F5 repeatedly). Being in development is safe; best to fail then, if we can find any problems, or bring experts in who can help us to do so.
Proxy measurements include things like port monitoring (you don’t know what someone’s doing, but you know they shouldn’t be trying to do it there), CRC complexity (it doesn’t tell you how easy it is to change the code, but it’s strongly correlated), bug counts, or the heartbeat from microservices (thanks, Dan and Fred George).
Unfortunately, there will always be instances where we don’t know what we don’t know. If a capability is hard to monitor, you can expect that at some point it will fail, especially if there are nefarious individuals out there determined to make it happen.
It’s worth spending a bit of time to stop them, I think.
Complexity estimation still applies.
The numbers to which I’m referring are explored in more detail over here.
If the capability is one that’s commonly delivered, most of the time you won’t need to worry about it too much. Ops are used to handling permissions. Devs are used to protecting against SQL attacks. Websites these days are generally secure. Use the libraries, the packages, the standard installations, and it should Just Work. (These are 1s and 2s, where everyone or someone in the team knows how to do it). Of course, the usual rules about complacency tipping from simplicity into chaos still apply too.
For capabilities which are particular to industries, domain expertise may still be required. Get the experts in. Have the conversations. (These are 3s, where someone in your company knows how to do it – or you can get that person into your company).
For any capability which is new, try it out. If it’s discrete, look for anything else that might break. If it’s continuous, make sure you try out your monitoring. And look for anything else that might break! (These are the 4s and 5s, where nobody in the world has done it, or someone has, but it was probably your competitor, and you have no idea how they did it.) It’s rare that a continuous capability is a differentiator, but it does sometimes happen.
Above all, and especially for 3s to 5s, make sure that you know who the stakeholder is. For the 4s and 5s it should be easy; it will be the person who championed the project, and your biggest challenge will be explaining the continuous capability in terms of its business benefits. (The number of times I’ve seen legacy replacement systems, intended to be easy to change, lose sight of that vision as sadlines approach…) For 3s, though, the stakeholders are often left out, as if somehow continuous capabilities will be magically delivered. Often they’ve never been happy. And making them happy will be new, so it might end up being a project in and of itself.
That’s mostly because continuous capabilities are difficult to test, so a lot of managers cross their fingers and hope that somehow the team will take these nebulous concerns into account. But now you know.
If you can’t test it, monitor it.
As best you can.