Exchange Ideas

Causal Capital

RMB - Risk, Markets & Banking

 

August 11, 2006

The issues of External Data in models isn`t to be expected

Banks that are reading the AMA segment of the Basel accord will find that it makes mention to many parts of the methodology portion of a typical framework taking in: scenarios, loss data, forward looking measures (indicators), control assessment and the nebulous use of external data. Out of all the quantitative measures that are used by an operational risk practitioner to dimension their potential exposure, external data seems to present one of the largest concerns. In this article we are going to look at some of the problems that come about from the use of external data and some of the good applications that are worthy of it.

Basel Accord Paragraph 674

A bank’s operational risk measurement system must use relevant external data (either public data and/or pooled industry data), especially when there is reason to believe that the bank is exposed to infrequent, yet potentially severe, losses. These external data should include data on actual loss amounts, information on the scale of business operations where the event occurred, information on the causes and circumstances of the loss events, or other information that would help in assessing the relevance of the loss event for other banks.

Just like many of the wonderful paragraphs of the Basel Accord, this statement gives some insight on what we should be focusing our attention on and it leaves enough latitude to allow some banks to create their own elaborate machinations, many of us in the industry simply struggle to interpret the best use of a scarce resource.


The Providers
Firstly we need to locate a good provider of such information before we do anything with it and there are a few companies out there that promote this service, but there aren’t that many and I am sure plenty of risk managers that have scanned the internet for hours trying to locate a good source of external data know this to be true. Some of the firms that provide external data include OpVantage (part of the Fitch group), OpRisk Analytics, Aon OpBase and a popular one is the British Bankers Association Global Operational Loss Database or GOLD. Some of these providers are member based requiring all subscribers to contribute as a condition of sale however, some banks seem reluctant to add to the database even though they wish to draw from it and this just highlights how immature the risk discipline is as a whole across the industry. Social good of all is a disclosure by a few? In the world of data there isn’t a place for utilitarianism and all banks will have a certain level of impartance on pillar II and pillar III of the accord but that is another debate altogether.

The largest problem with external data (once we have found it) is applicability, that is are losses in Europe representative of exposures in Asia and many banks have been looking at industry specific focus centres for a good source of information.


FSA Capital Requirements Directive Implementation, March 2006

We observed that all AMA aspirants had access to at least one centrally-sourced external loss database, and we saw some convergence in the data providers used ... Although many firms are assessing the need to scale external data, the ‘reliability’ of such data did not appear to have been evaluated by many firms. One simple test a firm could carry out is to assess the quality of reporting of its own events in the external database(s) to which it subscribes and some business areas were making active use of specialist/niche databases, for example on fraud and IT incidents. One concern is that there was limited awareness or recognition of the alignment of loss data to internal events in the design of the OR framework and the AMA solution overall.
FSA Article


Applicability is a two sided equation of course and banks that don’t classify their internal losses by product, Basel II mapped risk event classification and other pertinently transparent mechanisms are going to have real difficulties combining or mixing external data with internal data. The good old saying “one doesn’t want to compare apples with alphabet letters” and to be bluntly honest I would be concentrating on such the internal homogeneous taxonomy before worrying about the specific problems of data scaling. It is scaling though, that seems to be on the lips of most bankers when they talk about external data so we are going to briefly address it.


Scaling
Is the Size of an Operational Loss Related to Firm Size? Great question isn’t it. A group of risk analysts decided to put the theory to the test some six years ago when the European Commission proposed that capital charges for operational risk might be based on the size and income of a firm.

What they discovered was that “While it seems intuitive that operational risk is to some degree a function of firm size, the nature of this relationship is not straightforward” and really if we think this through it would incredulous to believe that it is. What these analysts found was that size only accounts for a very small portion (about 5%) of the variability in loss severity.

The result of the investigation can be found here on Gloriamundi

The size of the firm is related to the magnitude of the loss but such associations are not linear and that there is clear evidence that there is a diminishing relationship between size of firm and magnitude of loss. In the real world we see small firms suffering losses on a business line in a similar proportion to big firms. This is to be expected when one considers the context of a banking product and its faults but in which if we contemplate the problem carefully, it actually diminishes the whole argument itself. That is a poor correlation provides a place for external data and takes us into one of the good uses of external data and resolves the scaling issue altogether.


Stratification
We agree that to calculate the capital of a bank we take our historical internal loss data, fit it to a family of curves to create a hypothetical loss distribution and then measure a specific quartile of that curve (the confidence level) to give us an opVar number. The problem arises that internal data is often insufficient to accurately estimate the upper tail of the loss distribution because extreme losses rarely occur.

By combining external data with internal data we are able to increase our sample size and thus the estimate of our capital, assuming that we are drawing losses from the same loss distribution and we “hope” internal loss data includes all losses that have occurred while external data includes losses exceeding a known peer group reporting threshold.

One could of course throw all the data together into a new loss dataset however that generally overestimates the likelihood of high losses and takes us back to the scaling debate. The canonical solution to this problem is to stratify the sample by combining internal and external data to obtain many times our sample of losses and it works like this:

Suppose we have Y number of internal loss observations and Y+(Y*1/2) number of external loss observations and that external losses have been censored above Z. The internal data has Y+(Y*1/2) less than external data and the same above the censored mark. A sample from the loss distribution needs to contain all of the data for losses over Z from the internal and external data set and, four copies of each data point for loss below Z from the internal data. This new sample is not biased toward higher losses and incorporates all of the available information.

An example of this can be found on the IDEAS NETWORK

Another approach would be to estimate the loss distribution using the weighted average approach by calculating moments and quantiles of the loss distribution in a way that four times as much weight is given to any loss below the threshold as it is above the threshold for both internal and external data.

There are plenty of examples of stratification on the internet and I perceive it to be a straight forward yet statistically proven method for mixing the two datasets but we have to remember that our data must be from the same distribution (homogeneous) otherwise the measure is totally inaccurate.

One can liken it to a voting poll where a reporter asks ten people in a room (a sample of the population) who they are going to vote for and we all know the more people the reporter asks the more accurate their assessment becomes.

Posted by CausalEvents at 06:04 AM | Comments (2)

April 23, 2006

Units of exposure from central facility failures

Recently I was posed with a question 'how can an analyst dimension or even estimate loss from a central facility failure?'

This is an interesting concern because dependency on specific services of the bank present a conundrum for capital allocation when the risk analyst attempts to measure the multiplicity of impacts from a single event. Unlike a banks more common expected loss data, where events and outcomes can be attributed to a specific cause or department, the interference of a central facilities ability to deliver a consistent service presents a special concern. Most importantly many products, processing areas and diverse customer groups are affected by a singe event and many transactions cannot be completed until such facilities are returned. Then of course there can be knock on exposures as departments scrape lame resources together to complete their work in an effort to meet ever moving service level agreements.

In this brief we are going to investigate one approach for estimating the fiscal outcome of a set of potential events that cripple a central service.

Understandably this is a single concept and factors such as change management which directly contributes to the cause of these events, as well as managing these losses inherent to a business function are debates that have to be omitted from this material. To include them would otherwise turn what is a straightforward explanation into a convoluted paper, so let’s try and keep it a quick easy read.

The first problem we are faced with is a geographic one. Most central departments, such as an information technology system which suffer failure bear no direct impact from the fault. So when a system fails it may be the technology department that is culpable but certainly the processing area(s) that loose. This is one of the drivers for poor management of facilities, as many policies are designed with the technology in mind and not the service they support. IT resources are also often unaware of the outcome from their actions and hence make decision without considering the risk it poses. Quite simply this is often due to the fact that they are unaware of the true exposure, rather than some other political malice. Measuring exposure from such events is also hermetical because the probability of event, frequency and magnitude of loss are departed from each other.

In particular circumstances where losses are written off and attributed to a product or processing area without entrapment of causality, the ability to accumulate all loss capital directly to a source becomes a complex task as many alternate business lines may be affected at different times and in disparate ways from a disintegrated system.

___________________________
A Possible Solution
One solution to this problem is to accept the three variables (1) probability of event, (2) frequency of loss and (3) magnitude are found and measured separately. Then set about putting structures in place to gather those variables where they are found and finally formulate a method to combine them.

Hypothetically, say we have one IT system that is used by two separate business units, each of which resides on different divisional lines. Both the business units are processing different products that will contain their own inherent risk due from delivery failure when this system is unavailable.


[Variable 1] Deriving Magnitude
Describing the magnitude from unavailability is actually the most complex variable to dimension and acquiring such information needs to be tackled in two ways. The first and most obvious place to locate such detailing would be in loss data and this would make the whole exercise statistically straight forward however it also assumes that such losses have been recorded correctly showing the full extent of exposures and that they are representative of the event we are trying to describe. Has such an event occurred is one very important question, simply assuming that it can’t occur because it hasn’t without a recording of this assertion would leave gaps. Then of course how many events are there?

A second and more fastened style of dimensioning loss potential is to describe what is at risk. The value can actually be found in the products or services themselves, those very same products that are impacted when actions are taken by staff to meet a specific service agreements and which have been impacted by the unavailability of our system. This information is readily available by looking at normal mode of operation which assumes full system availability. The analyst captures the value of products processed during such operation mode by questioning the business unit.

By grouping the occurrence of product values allows them to be placed into a normal distribution so that variance of the potential extent of loss can be understood however we are all aware that the entire product value is unlikely to be the total exposure. It can go either way depending on the product and will require solicitation of business unit management for a more accurate adjustment.

In this way the business unit has been able to add their comments on why the delivery of these products would fail, what the cost is and a true understanding on what that expense is comprised of. Even in cases where internal data is good, such knowledge is often concealed as the loss amount only shows a consolidation of costs, fees, fines and rework efforts.


[Variable 2] Frequency of the Event
Just as with the magnitude dimension, the number of losses recorded from a single event in a business unit would obviously be proportionate to the expected number of items processed in normal operation mode combined with the expected down time from the event. That is each hour of operation an unexpected number of processing units are going to arrive and queue up, all of which can be plotted using the Poisson or negative binomial distribution to locate the lambda of frequency of events. These “frequency of normal events” are the processing units at risk. Assuming no remedial action is taken, staff need to disclose the maximum outage time that can be endured before the first unit converts to a loss and that becomes our threshold of downtime.

So just like our magnitude of loss, the frequency in time is adjusted in a buffered manner. Some business units will actually suffer losses with minute outages and others can sustain hours of down time before significant losses are experienced however without correlating accurate internal loss data to system outages a good method to gain an understanding on the potential number of events from a single outage is to investigate this again at a business unit or EGM level.

By investigate risk in this manner we are able to translate a hypothetical 'definition of up time' before loss occurs and for each business unit and product in turn. While this seems blatantly obvious, it is an important point to note. By defining our up time we are able to tie potential events to reliability factor of the IT system in question, we are also able to properly set IT systems SLA and qualify our exposures. From a management perspective we may also chose to investigate best practice for introducing alternative solutions while resolving outages at a business unit level. Certainly all these activities need to be planned and recorded so that improved processes/controls can be used as methods for capital reduction at a later stage.


[Variable 3] Modeling Probability of Event
To follow this analytical process through, the probability of event should be treated as a closed system with its own set of causal factors and away from the frequency / magnitude of losses experienced in the business unit. I am not stating that such an event (there could be many) is endogenous but it is easier to model in isolation of business unit activity which has its own nature as we have shown. All that said the investigative route would be similar in nature to the business units except entirely focused on Mean Time Between Failure, the causes of those failures and Mean Time To Repair those failures.

For a system where the breakdown failure rate is constant with respect to time or over a random period, the calculation of reliability can be simply represented as follows:

R = e-lt = e -t/MTBF

Where
- R is reliability
- T is mission time in hours and can be taken over any period entire week.
- l=1/MTBF and is the (average) failure rate per hour.
- MTBF is mean time between breakdown failures in hours
- e= 2.718218 and is a constant representing the exponential function

For example, if l = 0.01 per hour (that is failure rate of 1 per 100 hours) and t = 10 hours then R = 0.9. That is, the system has a 90% chance of operating continuously for a 10 hour period.
Where the mission time equals the MTBF, the reliability formula reduces to:r = e-1 = 0.368. Close to 37% criticality.

Example from R2A

In respect to unreliability this can be calculated using = 1- e -lt
For t=1, l is very small 10-7 then: l ≈ 1 - e –lt

This is the point at which reliability and fault rate becomes equivalent to the failure frequency. Using the above formulae for unreliability, a single year period can be represented into different time interval breaks and those can be investigated directly on the business unit and to truly understand the drivers for probability of event the analyst might want to track the contributing factors that cause system outage.

The model though, should theoretically work if all cases are aggregated to give a total probability of MTBF+MTTR. If the analyst wants to be able to qualify / manage the operational event however, this type of information will be critical as a preemptive to planning maintenance to reduce failure.

[More information can be found by searching on “Reliability Centered Maintenance” and it has many authors over the last fifteen years]

This Excerpt was derived from R2A


[Variable 1,2,3] The Convolution Process
The final step in our piece of course is to combine all the components and that is done by connecting the three variables in a single model or to be precise merge the probability of event function with the frequency and magnitude variable distributions. While this seems quite a complex problem particularly as the variable spaces themselves belong to different family of curves, it can be all be resolved using a simulated process such as Monte Carlo.

Monte Carlo also has the advantage of allowing the analyst to step through its iterations to see where specific limits are creating the worst outcomes. The report output of hypothetical losses can then be passed to the business units for their input and approval. This final qualitative step is an important part of allowing the business unit to signoff on their opVar number. It also encourages the business unit managers to understand why events occur and eventually what to look for as an indicator of fault.

Most importantly Monte Carlo simulation creates a distribution of losses. This can be ranked and inserted into probability distribution function and that approach has been accepted by the regulator as an appropriate method for estimating expected and unexpected loss.


Wrapping it up
As shown, data is being captured from different business units and then combined and, while this holistic approach seems quite straight forward; the success of such a program will require a careful project style of management to prevent analysts having to cover their tracks more than once. Large banks also have a tendency to silo departments by function and unless the program has a group wide sponsor, acquiring the specific data points outlined here could be met with some resistance.

Posted by CausalEvents at 07:46 AM | Comments (0)

March 12, 2006

A different kind of liquid risk analysis

When modeling operational risk, it is usual for the practitioner to divide the analysis into two parts. This results in:

1) The creation of a severity of loss probability model

and

2) The construction of a frequency of loss probability system

In this brief article we will quickly link these two measures and also look at one contribution to statistical theory.

A single operational risk event is comprised as being part of a system of many events, each with a specific loss value of its own however, the primary reason for this division of analysis is that the numerical properties of these two distributions (frequency and magnitude) operate under completely different dynamics, right down to the application of measurement in variance and means. The severity loss model is part of a continuous distribution that may take on any value between a lower and upper limit, while the frequency model is likened to the random number of customers walking through the door and is a discrete analysis used to understand the count of events within the number of combinations of possibilities.

The simple essence of Monte Carlo is a convolution process used to combine these two forms of analysis and results in a single picture of what might occur considering the current variables that have been measured.

Interestingly though, the history of mathematical analysis is often derived from applications far from the field of science, banking or even economics.

That fundamental ideas in applied mathematics would be developed in a brewery sounds sufficiently improbable, but the story is true and intriguing. The statistical technique most often used to study events of low probability was discovered by a Polish mathematician and an employee of the Guinness brewery.

The scholars behind the stout - John Kay

Posted by CausalEvents at 03:01 PM | Comments (0)

What can I do with PRMIA online?