Understandably this is a single concept and factors such as change management which directly contributes to the cause of these events, as well as managing these losses inherent to a business function are debates that have to be omitted from this material. To include them would otherwise turn what is a straightforward explanation into a convoluted paper, so let’s try and keep it a quick easy read.
The first problem we are faced with is a geographic one. Most central departments, such as an information technology system which suffer failure bear no direct impact from the fault. So when a system fails it may be the technology department that is culpable but certainly the processing area(s) that loose. This is one of the drivers for poor management of facilities, as many policies are designed with the technology in mind and not the service they support. IT resources are also often unaware of the outcome from their actions and hence make decision without considering the risk it poses. Quite simply this is often due to the fact that they are unaware of the true exposure, rather than some other political malice. Measuring exposure from such events is also hermetical because the probability of event, frequency and magnitude of loss are departed from each other.
In particular circumstances where losses are written off and attributed to a product or processing area without entrapment of causality, the ability to accumulate all loss capital directly to a source becomes a complex task as many alternate business lines may be affected at different times and in disparate ways from a disintegrated system.
___________________________
A Possible Solution
One solution to this problem is to accept the three variables (1) probability of event, (2) frequency of loss and (3) magnitude are found and measured separately. Then set about putting structures in place to gather those variables where they are found and finally formulate a method to combine them.
Hypothetically, say we have one IT system that is used by two separate business units, each of which resides on different divisional lines. Both the business units are processing different products that will contain their own inherent risk due from delivery failure when this system is unavailable.
[Variable 1] Deriving Magnitude
Describing the magnitude from unavailability is actually the most complex variable to dimension and acquiring such information needs to be tackled in two ways. The first and most obvious place to locate such detailing would be in loss data and this would make the whole exercise statistically straight forward however it also assumes that such losses have been recorded correctly showing the full extent of exposures and that they are representative of the event we are trying to describe. Has such an event occurred is one very important question, simply assuming that it can’t occur because it hasn’t without a recording of this assertion would leave gaps. Then of course how many events are there?
A second and more fastened style of dimensioning loss potential is to describe what is at risk. The value can actually be found in the products or services themselves, those very same products that are impacted when actions are taken by staff to meet a specific service agreements and which have been impacted by the unavailability of our system. This information is readily available by looking at normal mode of operation which assumes full system availability. The analyst captures the value of products processed during such operation mode by questioning the business unit.
By grouping the occurrence of product values allows them to be placed into a normal distribution so that variance of the potential extent of loss can be understood however we are all aware that the entire product value is unlikely to be the total exposure. It can go either way depending on the product and will require solicitation of business unit management for a more accurate adjustment.
In this way the business unit has been able to add their comments on why the delivery of these products would fail, what the cost is and a true understanding on what that expense is comprised of. Even in cases where internal data is good, such knowledge is often concealed as the loss amount only shows a consolidation of costs, fees, fines and rework efforts.
[Variable 2] Frequency of the Event
Just as with the magnitude dimension, the number of losses recorded from a single event in a business unit would obviously be proportionate to the expected number of items processed in normal operation mode combined with the expected down time from the event. That is each hour of operation an unexpected number of processing units are going to arrive and queue up, all of which can be plotted using the Poisson or negative binomial distribution to locate the lambda of frequency of events. These “frequency of normal events” are the processing units at risk. Assuming no remedial action is taken, staff need to disclose the maximum outage time that can be endured before the first unit converts to a loss and that becomes our threshold of downtime.
So just like our magnitude of loss, the frequency in time is adjusted in a buffered manner. Some business units will actually suffer losses with minute outages and others can sustain hours of down time before significant losses are experienced however without correlating accurate internal loss data to system outages a good method to gain an understanding on the potential number of events from a single outage is to investigate this again at a business unit or EGM level.
By investigate risk in this manner we are able to translate a hypothetical 'definition of up time' before loss occurs and for each business unit and product in turn. While this seems blatantly obvious, it is an important point to note. By defining our up time we are able to tie potential events to reliability factor of the IT system in question, we are also able to properly set IT systems SLA and qualify our exposures. From a management perspective we may also chose to investigate best practice for introducing alternative solutions while resolving outages at a business unit level. Certainly all these activities need to be planned and recorded so that improved processes/controls can be used as methods for capital reduction at a later stage.
[Variable 3] Modeling Probability of Event
To follow this analytical process through, the probability of event should be treated as a closed system with its own set of causal factors and away from the frequency / magnitude of losses experienced in the business unit. I am not stating that such an event (there could be many) is endogenous but it is easier to model in isolation of business unit activity which has its own nature as we have shown. All that said the investigative route would be similar in nature to the business units except entirely focused on Mean Time Between Failure, the causes of those failures and Mean Time To Repair those failures.
For a system where the breakdown failure rate is constant with respect to time or over a random period, the calculation of reliability can be simply represented as follows:
R = e-lt = e -t/MTBF
Where
- R is reliability
- T is mission time in hours and can be taken over any period entire week.
- l=1/MTBF and is the (average) failure rate per hour.
- MTBF is mean time between breakdown failures in hours
- e= 2.718218 and is a constant representing the exponential function
For example, if l = 0.01 per hour (that is failure rate of 1 per 100 hours) and t = 10 hours then R = 0.9. That is, the system has a 90% chance of operating continuously for a 10 hour period.
Where the mission time equals the MTBF, the reliability formula reduces to:r = e-1 = 0.368. Close to 37% criticality.
Example from R2A
In respect to unreliability this can be calculated using = 1- e -lt
For t=1, l is very small 10-7 then: l ≈ 1 - e –lt
This is the point at which reliability and fault rate becomes equivalent to the failure frequency. Using the above formulae for unreliability, a single year period can be represented into different time interval breaks and those can be investigated directly on the business unit and to truly understand the drivers for probability of event the analyst might want to track the contributing factors that cause system outage.
The model though, should theoretically work if all cases are aggregated to give a total probability of MTBF+MTTR. If the analyst wants to be able to qualify / manage the operational event however, this type of information will be critical as a preemptive to planning maintenance to reduce failure.
[More information can be found by searching on “Reliability Centered Maintenance” and it has many authors over the last fifteen years]
This Excerpt was derived from R2A
[Variable 1,2,3] The Convolution Process
The final step in our piece of course is to combine all the components and that is done by connecting the three variables in a single model or to be precise merge the probability of event function with the frequency and magnitude variable distributions. While this seems quite a complex problem particularly as the variable spaces themselves belong to different family of curves, it can be all be resolved using a simulated process such as Monte Carlo.
Monte Carlo also has the advantage of allowing the analyst to step through its iterations to see where specific limits are creating the worst outcomes. The report output of hypothetical losses can then be passed to the business units for their input and approval. This final qualitative step is an important part of allowing the business unit to signoff on their opVar number. It also encourages the business unit managers to understand why events occur and eventually what to look for as an indicator of fault.
Most importantly Monte Carlo simulation creates a distribution of losses. This can be ranked and inserted into probability distribution function and that approach has been accepted by the regulator as an appropriate method for estimating expected and unexpected loss.
Wrapping it up
As shown, data is being captured from different business units and then combined and, while this holistic approach seems quite straight forward; the success of such a program will require a careful project style of management to prevent analysts having to cover their tracks more than once. Large banks also have a tendency to silo departments by function and unless the program has a group wide sponsor, acquiring the specific data points outlined here could be met with some resistance.