A lot of questions need to be asked over RBS’s computer problems – but if we want to stop this happening again, we need to listen to the answers.
An easy answer. But not a useful one. |
So there we have it. For anyone who
questions the value of software testing, here is a prime example of
what happens when you let a bug slip through. I know we’ve already
moved on to another banking scandal, but in case you’ve forgotten:
many Natwest customers failed to get paid
owing to a botched system upgrade. This has led to all sorts of
consequences, and the obvious question of how this could be allowed
to happen.
Except that when people ask this
question, I fear most of them have already decided on the answer,
which is that RBS is a bank and therefore Big and Evil and
responsible for everything bad in the world from Rabies to Satan to
Geordie Shore. That answer might make people feel better but does
little to stop this happening again. In practice, what went wrong is
likely to have little to do with the credit crunch or banking
practices and a lot to do with boring old fact that any bank – no
matter how responsibly they borrow and lend – runs on a highly
business-critical IT system where any fault can be disastrous.
An easy claim from a software tester
would be that RBS, as Natwest’s owner, must have gone cheap on the
testing. I suspect it won't be that simple. By its very
nature, a banking IT system is going to be very complex – it
has to be capable of handling thousands of transactions every second
whilst keeping itself totally secure from hackers – so it would
benefit from as much testing as possible. But, as any ISEB-qualified
tester can tell you: exhaustive
testing is impossible. There is always a balance between testing
and finance, and testing has to be prioritised and targeted. This is
taken for granted all the time, and it’s only when things go wrong
that we ask why.
The fact remains, however, that
something went seriously wrong. The
Treasury Select Committee is already asking what happened, as has
the FSA, so we should get more details on what happened soon. But how
much we learn will depend on whether the right questions are asked.
So here are my suggestions:
- Was the upgrade necessary? Chances are, it was. Security loopholes are uncovered all the time, and a security update for a banking system can’t wait. But if it was an update for the sake of updating, that would be a different matter.
- Were they using out-of-date software? I can’t comment on what banking software is and isn’t used, but I know of numerous systems that doggedly stick to Windows XP or Internet Explorer 6 in spite being horribly error-prone in a modern IT environment. A business that becomes dependent on out-of-date components, and fails to bite the bullet and upgrade when it needs to, only has itself to blame when the testing can’t keep up with the bugs.
- Was enough time allowed for testing? As a rule of the thumb, every day of development should be matched by at least one day of testing. A common mistake is when software uses commercial off-the-shelf products as back-end components, little testing is done in the belief that the commercial product is bound to work fine. In my experience, that gamble usually backfires.
- Was everything tested that should have been tested? This might seem obvious, but it’s not unusual to concentrate on easy feature tests without paying much attention to more problematic areas such as performance or integration.
- Was the timescale realistic? I ask this only because a common response to a software project overrunning is to cut the testing time. That is a stupid thing to do, but if the budget and timescale has been set in stone the project manager might have had no other option.
- Did they carry on monitoring the update after it was implemented? Software that worked perfectly in the test environment can still fail in the live environment. Since it took them three days to identify the case of the problem, they have some explaining to do here.
- Was the testing correctly prioritised by risk? To state the obvious, when an area of the software is known to be likely to break, or the consequences of a component going wrong will be severe, you need to concentrate testing on this are (and not spend your time doing endless repetitive tests of low-risk areas). What’s not so obvious is identifying what are the high-risk areas in the first place. And this brings me to a pertinent question.
- Did the people in charge of the testing properly understand the job? This is where RBS may have a case to answer. The Unite union has suggested that RBS replacing outsourcing their IT work abroad was to blame. I don’t believe in assuming off-shored worked is cheaper, more expensive, sloppier, better quality, faster, slower or any other silly generalisation. But when you suddenly outsource your IT work to another country, you lose most of your in-house expertise – quite possibly the people who knew what the risks were and how to avoid them. In the worst-case scenario, the work may have ended up with people whose idea of testing is telling you everything’s fine.
However, it might be that RBS has
perfect questions for all of the above. That would still not
guarantee that nothing can go wrong. As exhaustive testing is
impossible, there is always a chance that an untested area thought to
be low-risk goes disastrously wrong anyway, and there is no foolproof
way of stopping this. So I have two final very important questions:
- Did they have a fall-back plan for a fault making it into the live environment? No matter how good your test plan is, you always have to think “What’s the worst that could happen?” The wrong answer is “But it definitely won’t happen.” The #1 mistake of the Titanic was not the design flaws that allowed the ship to sink, but the foolish assumption that as the ship was unsinkable there was no need to provide enough lifeboats. Did RBS do a Titanic and assume their tested upgrade couldn’t possibly go wrong? I doubt they would have been stupid enough to have no plan at all, but this leads me on the other important question.
- If they had a contingency plan, was it credible? In far too many cases, contingency plans are made for reviewing, signing off and shelving but not actually implementing. When the sole purpose of a contingency plan is to allow you to say “Yes, we have a contingency plan,” … well, you can imagine the rest.
But all of these questions rely on
an attitude of “What went wrong?” first, and “Who went wrong?”
a long way second. Unfortunately, there are already signs of thelatter option being favoured. I’ve seen what happens when people
blame each other for IT problems, and it’s not a pretty sight.
Whatever story RBS offers, there are valuable lessons to be learned.
I only hope someone’s interested in learning these lessons.
No comments:
Post a Comment