The interactions between risk management and technologyFocus both on technology developments that can help improve risk management, but also on other aspects such as operational and potentially strategic risks caused by advances in technology
| |
« Big Data and Risk Management Part 2: Relational Databases |
Main
| Advanced Risk Analytics »
January 25, 2011
Big Data and Risk Management Part 4: Applications
The past two postings on this thread focused on the two Big Data technologies that seem most mature today - Massive Parallel Relational Databases and MapReduce technology, best exemplified by the open-source Hadoop project. I discussed the technologies themselves, and talked about their benefits and drawbacks. In this, the final post of the series, I will address the application of these technologies to risk management.
Before getting into their applications let's spend a moment to consider the analytic requirements of risk management. The term "analytics" is much overused. A good place would be to start at the definition of the word - according to dictionary.com, analysis means "this process as a method of studying the nature of something or of determining its essential features and their relations". This meaning certainly applies to risk management, since most of the heavy lifting involves trying to understand unanticipated adverse outcomes due to either portfolio positions being actively taken or due to changing markets. But risk management is required to more - including to predict unforeseen risks that potential strategies may entail as well as, increasingly being integrated into active decision-making in the front-office through risk-based pricing processes.
I find it useful to classify analytics into the following levels of maturity:
- Reporting what happened: Traditional risk reporting (daily, weekly, monthly or yearly) fall into this category. This is pretty easy to do - collect the same data over and over again, perform some aggregations (like sum all exposures for a given product type from all business units or geographies) and report them. Many firms do this today manually, with teams of people collecting information via spreadsheets and even email, then plugging the results into other spreadsheets to produce the final report.
- Analyzing why it happened: Extension of traditional risk reporting. Analysis also means being able to issue a series of queries of increasing specificity, each query in response to results from the previous one. For example, to answer the question "why did Economic Capital for a product rise even though total exposure fell" one would need to drill down via a series of typically ad-hoc questions to determine the cause was due to adverse change in some factor (like an increase in Loss-Given-Default).
This is much harder to do manually. The manual process for reporting is one-way and inflexible. Asking these sorts of questions requires going back through the human chain and getting further details. While barely feasible (and in fact the way things are done in many firms!) it is also slow and error-prone.
- Predicting what will happen: This is the meat-and-potatoes of good risk management. Technically, predicting risks implies the ability to analyze historical information. Increasingly, this also implies the ability to stress-test which implies adding new data to existing repositories of historical information.
Predicting risk requires a variation of the test-and-learn process that is used for analysis, but it takes it a couple of steps further. First, there is the business of building risk models which is a massive consumer of historical data, usually the more the merrier. The process involves identifying a subset of columns that, working together, can predict risk. Say one wants to predict a fraudulent transaction coming in among a stream of credit card authorizations. Building a risk model to catch fraud involves finding the appropriate combination of data elements (such as time of authorization request, request amount, location etc.) that together will indicate a possibly fraudulent request (for example, a common case is small charges being put on a stolen credit card at a gas station to check whether the card is operational before loading it up with $3000 worth of stereo equipment). After the model is built, it will then need to be put through its paces in the real-world and it's predictability calibrated over time. Variable reduction of this type requires access to a wide range of data elements at will. In fact, in fast moving areas of risk such as fraud (or indeed quickly changing markets), the speed-to-market of a model can be a significant factor in saving money.
- Describing what is happening now: This is becoming more important as risk metrics are integrated into business processes. For instance, performing pre-deal analysis implies an ability to analyze the effect of the deal on as recent a view of the existing portfolio as possible - the more real-time the better.
There are other ways to slice up the risk architecture pie with their own unique challenges (such as to consider the various kinds of data - reference data, transactional data and market data - required to analyze risk) but for now let's just consider the problems caused by the necessity to process large data sizes. Mapping Big Data technologies to ApplicationsTo properly place the data technologies I've discussed in the past couple of postings in context, let's consider the high-level risk architecture below. This is admittedly very banking and credit-risk centric, but is illustrative nonetheless. At a high-level, data is taken from source systems, cleansed and transformed as appropriate and loaded into a repository of some kind. Risk Applications manipulate the data in different ways (for example to calculate economic capital). Risk management users also use the information in their own processes in different ways. Finally, there is a reporting component as well where data - both raw and derived - is reported.  Let's look at the types of processing for each of these 3 components: - Data Cleansing and Transformation: Data comes in from source systems most often as flat files, though there real-time data may be fed in from a data bus. While financial data is usually "structured" neatly into rows and columns, there are all manner of problems found with such data including data quality issues, the necessity to deal with snapshot data-sets instead of changes etc. Data files are usually textual in nature. The goal of this phase is to process large volumes of data quickly and efficiently, remove imperfections in data, and transform it into a target state suitable for analytics. Today, this sort of processing is usually done either by ETL software or by bespoke programs written in a scripting or programming language.
- Analytic Repository: This component needs to perform adequately to respond to the various stakeholders of the warehouse. The full range of functions needs to be addressed here - from simple reporting to complex ad-hoc queries from risk-model developers. Currently, this environment is not addressed as a single monolithic entity like a single database shown here. Rather there is usually a confusing mish-mash of different systems moving data in a frenzy between themselves, with each system performing one of these specific functions. Simple reporting functions are usually performed in an RDBMS system, often with a denormalized model. For ad-hoc analytics a pattern that is often seen is to have quantitative analysts extract vast quantities of data to perfom analytics using their favorite analytics package (SAS, R, SPSS etc.). A key characteristic of ad-hoc analysis is the necessity to create wide extracts of data from often independently maintained data-sets and involves join operations. For example, fraud analysis often involves analyzing transactions a row-at-a-time, but each row needs to contain information about details of the client and account (such as client risk scores), necessitating in this case a join between transactions and clients.
- Data Processing applications usually employ the same extract-and-process model as well, except that the process is "industrialized" as opposed to being ad-hoc. Most often, the derived data is used in a downstream process - for example calculated regulatory capital may be passed to a stand-alone reporting application - but not usually brought back into the data repository and reused. Currently these applications are either vendor or internally-coded applications.
So where do our Big Data technologies fit into this architecture? - Relational Databases: Given my prediction that RDBMS systems will largely migrate to MPP databases in the near-future, their strength is to perform ad-hoc joins on large quantities of data with a simple, declarative syntax using SQL.
My prediction is that the adoption of these databases will result in a couple of trends. First, the fragmented data silos (2 in the picture above) that masquerade for risk data environments will be slowly integrated into a single risk data warehouse. Not that all data will necessarily integrated, but it will be integrated to the extent that it makes business sense, unconstrained by technology barriers as it is today. Second, with the increasing scalability and performance of such systems, we will see dramatically less extraction to do ad-hoc analytics. Rather, we will see an increasing trend of moving analytics programs to the data.
- MapReduce technologies: MR technologies are good for a couple of things - they offer cheap access to massive computing power (note the italics on computing), and they are able to flexibly operate on semi-structured data (I'm emphatically not including "NoSQL" as an advantage in itself, other than as part of the previous point. Also, while High-Availability is a definite advantage of MR, in the context of this architecture it is somewhat irrelevant).
I see most use of MR technologies in 1 and 3 above. MR technologies allow programmers a lot of flexibility in writing code to manipulate data in a semi-raw form to become whatever they want it to become - this is also a key function of ETL technologies. Even today a lot of ETL is developed using bespoke programs rather than vendor ETL tools - and the programs do a lot of computation - a prime target for use of MapReduce technology. MR can also be used as a processing system, rather than a data analysis system. After all, each node of the MR system includes both CPUs and disk, and there is no reason that MR cannot be used as a system for massively parallel grid computations such as Monte Carlo simulations.
Where I don't see MR playing an important role is in ad-hoc risk analysis (2 in the diagram above). There are two reasons for this. MR is architecturally not suited to join sets of data, which is by definition a requirement in risk analytics. One area that MR can be used for one-off analysis of raw datasets to identify risk factors, but this function is most often co-opted by quants today using tools such as SAS or R. These are not folks that take kindly to changing languages and all-of-a-sudden becoming Java fans. So while it's technically feasible to do some of this processing in MR, there are a lot of factors working against it at least in the financial industry.
So there you have it - my predictions for where RDBMS and MR technologies will fit into an overall risk architecture. What do you think? It'd be especially interesting to hear from any readers who are using MR in the risk space.
Posted by dkrishna at January 25, 2011 03:41 PM
Again an excellent job...the four part article should be made a recommended reading for finance, risk, accounting and technology professionals.
Posted by: Tony Awoga at February 12, 2011 11:21 PM
Post a comment
|
|