Exchange Ideas

Systems Risk

"Systems Risk" is in the position that Operational Risk was a decade ago (pre Basel II) in that everyone knows that Information Technology is a major issue in Financial Services but the industry has not found satisfactory ways of analysing and measuring the associated risks. Many business surveys point to IT being of vital interest to Boards and senior management, but we (the IT profession) keep screwing up - I would argue because, in part, neither the IT function nor business has yet learned how to manage risk.

 

« All that Glisters is not Gold | Main | Shut that Stable Door! »

February 13, 2008

Ghosts of New Year's Past

As the ball drops in Times Square and fireworks explode around the world, New Year's Eve is a time of hope for the future and an opportunity to reflect upon the past. For IT professionals, one particular New Year's Eve will be remembered for a very long time.

At midnight on the 31st December 1999, nothing happened!

After spending many billions of dollars on the so-called Y2K problem, there was, initially at least, relief that the sky had not fallen in after all. Charitably, it could be claimed that all of the money, time and effort spent by businesses on replacing computer systems to fix the "Millennium Bug" had averted a major catastrophe.

As time passes, however, most experts (and the man in the street) would argue that the money had largely been wasted. Uncharitably, Y2K was a gigantic fraud perpetrated by the IT community on business and the general public.

With time to reflect on the events leading up to 2000, it is apparent that Y2K was a massive failure of risk management, resulting from a serious breakdown in the "Risk Management Process". In particular (using COSO terminology), the processes of Risk Assessment and Risk Response were seriously flawed.

In retrospect, the process of Risk Identification for Y2K worked.

There was indeed a risk that computer systems might not work, or work erratically, as the year tripped over from '99' to '00'. In particular, for those systems that used only two digits to represent dates, it might appear that time was running backwards. There was indeed the potential for problems to occur even in non-critical systems.

The next stage in a robust risk management process, Risk Assessment (or Risk Evaluation in AS/NZS 4360 terminology), is where the Y2K process began to break down.

It is not too difficult to envisage scenarios where computer microprocessors might go awry and cause planes to fall from the sky, power plants to shut down and even nuclear missile attacks to be triggered inadvertently. Driven by a media frenzy, and the emergence of millennial sects claiming that the world was about to end, the public were spooked and politicians and regulators had to be seen to act: how - didn't matter - just do something!

A rational, considered, mature Risk Assessment of the impact of the Y2K date "rollover" would have quickly concluded that, for most businesses, the risk of a catastrophe occurring, to them, was very low. Yes, the military and critical service providers would have to do a very detailed risk assessment of the impact of the date change on their systems and remedy any problems that were identfied. However, for most firms, the risks were unlikely to be large and since firms are regularly embarrassed by sending incorrect bills and statements to their customers, the damage would hardly be catastrophic.

However, the scary cat was out of the bag and business managers were spooked by the possibility (admittedly very low) of a major disaster.

[Lessons: (1) Do not let the tail of a probability distribution dominate business actions. (2) Risks are specific to a setting; just because another business, or the government, may have a potentially large problem, does not mean that your firm has a similar sized risk.]

Having failed to assess the risks in Y2K adequately, firms then compounded the problem by mismanaging their Risk Responses (or Risk Treatments in AS/NZS 4360 terminology).

Risk Responses can be classified under four generic "strategies" : Avoid, Reduce, Share and Accept/Retain, colloquially known as the 4Ts - Terminate, Treat, Transfer and Take. Responding to an identified risk involves employing one or more of these generic strategies to achieve the level of risk acceptable to management - their so-called "risk appetite".

Now, while the IT community deserves to shoulder the bulk of blame for the Y2K fiasco, business management must also share some of the responsibility. In order to adequately respond to risks, there must be some decision on how much residual/retained risk is acceptable, after a risk treatment is completed.

Entirely eliminating risk, even if feasible, can be a very expensive exercise.

In order to determine how much money should be spent on treating risks, management should ideally provide clear guidance on risks that are acceptable, i.e. they should articulate their "risk appetite".

A recent study [1] by the Financial Services Authority (FSA), the UK financial services regulator, highlighted the difficulties of developing a quantifiable risk appetite, but nonetheless a considered risk appetite is a necessary pre-condition for managing risk effectively. The absence of clear direction from management, and without good assessment of the actual risk, meant that IT department were, in effect, given an open checkbook to just make the Y2K problem go away.

Before looking at other risk response strategies, it should be noted that, in this case, "Avoid" was simply not an option - the year 2000 was coming and there was no way around the problem.

Looking next at a "Retain" risk response strategy for Y2K, it is obvious that, while the impact of the Y2K date-rollover had the potential to be embarrassing for many firms, e.g. selling insurance to a minus 25 year old man, it would hardly be catastrophic. It should be noted here that "Retain" is not "Ignore", but is the conscious, considered acceptance of the possibility of some damage given a rigorous assessment of the range of problems that could possibly occur. In many risk situations, Retain is often the most rational risk response decision that can be taken by management.

The "Share" response strategy is often thought off merely as insurance, which, given the substantial premium costs, was never really a viable option for Y2K. However, some of the most successful "Sharing" risk response strategies involve working with suppliers, customers and even competitors to minimize risks. For example, outsourcing (i.e. sharing with suppliers) is a valid response to the risks of cost escalation in IT and financial services operations.

But in this instance, natural partners, such as hardware and software suppliers, did not have the motivation to share the risk. In fact, they had the opposite incentive to sell the latest hardware and software to their existing customers, generating several more years of locked-in profits. Likewise, consulting firms did not have anincentive to play down the Millennium Bug problem, since they would benefit enormously from staffing Y2K projects.

[Lessons: (1) one firm's risk is another firm's opportunity and (2) it is difficult to align incentives between different stakeholders when managing risk.]

Without rational analysis and lacking guidance as to what level of risk would be tolerable, IT departments were inevitably driven towards employing strategies to "Reduce" Y2K risks.

Even at this advanced stage of risk mismanagement, there were strategies for reducing risk that would not be overly expensive. Reducing risk involves either reducing the "likelihood" of the risk event occurring or/and reducing the "impact" of risk events should they happen.

For a low probability event, reducing its impact is, often, less expensive than trying to reduce its probability/likelihood. An example is disaster recovery planning, where the costs of making a building completely fire/earthwork proof are so prohibitive that it makes economic sense to reduce any adverse outcomes by using a "back up" building which is unlikely to be impacted by a single disastrous event (though 9/11 showed that multiple adverse events can happen simultaneously).

Reducing the impact of a risk event means analyzing all of the possible "outcomes" of the event and putting in place "reactive controls" to minimize the total losses. In the case of Y2K, this would have meant, for most firms, taking corrective action only after problems have occurred, for example fixing and re-running any faulty statement processing/billing systems and, of course, apologizing to customers.

However without rational, considered risk response analysis, IT departments, around the world, embarked upon strategies to "eliminate" the event happening (i.e. reduce the likelihood/probability to zero.) They were encouraged in this by hardware and software suppliers who maintained that only their latest products could be certified as "Y2K complaint".

However, modern systems environment are a complex mix of purchased and homegrown systems and wholesale hardware/software replacement involves huge and expensive testing projects, often staffed by external consulting companies. IT departments often justified these expensive projects as the necessary refurbishment of their overly expensive systems environment. Using Y2K as an excuse, IT departments went "missing in action", spending vast sums of money in updating "legacy" systems, while ignoring businesses pleading for new systems.

[Lessons:(1) do not overlay risk response projects with other "good ideas"; and (2) when treating risk, beware the "pet project".]

At the highest level, Y2K was a risk management disaster, and the IT community lost its credibility with business by wasting money and time in fixing problems that simply did not exist.

Not that Y2K projects were completely worthless. They did give an impetus to the concept of BCP (Business Continuity Planning) that, arguably, allowed critical service providers and firms to recover more quickly after 9/11.

Y2K was closely followed by yet another risk management disaster – the dot-com bust, where the other side of the risk equation was apparent. Contrasting Y2K where risks were over-estimated, with the dot-com boom opportunities were greatly exaggerated. In this case, however, the IT community was not entirely to blame; investor greed was largely responsible for over-valuing high-tech companies when the potential of achieving meaningful investment returns was slight. Even a cursory assessment of many of these "opportunities" would quickly have uncovered potential problems.

Why is it important to look back at Y2K and try to draw some useful lessons?

For regulated financial institutions, the new Basel II regulations require that firms set aside capital to cover "the risk of loss resulting from inadequate or failed internal processes, people and systems or from external events". Under Basel II, regulated institutions must take "systems risk" seriously.

Unfortunately, given the continuing incidence of high-profile failed IT projects, there is little evidence that firms have improved the overall management of "systems risk", although there are bright spots in a few areas, such as BCP and Information Security. Given the often-professed statements by senior management that financial services relies on "good systems", and the sheer bottom-line size of systems expenditure, it is surprising that there has been so little emphasis on managing systems initiatives to minimize the potential for failures or, equally important, to maximize opportunities for success.

Business management regularly identifies "IT risk" as a major concern[2], but there appears to be few serious attempts to treat systems risks in the same way as other strategic and business risks. This is partly because there are few "risk frameworks" that fully address the complexity of systems management in modern financial institutions. The author hopes to expand on this topic, in later web-logs.


References:
[1] "Operational Risk Appetite" FSA April 2007, www.fsa.gov.uk
[2] "Best Practice Risk Management - A function comes of age" EIU May 2007, www.eiu.com

Posted by pjmcconnell at February 13, 2008 01:38 AM

Comments

Post a comment




Remember Me?

(you may use HTML tags for style)

What can I do with PRMIA online?