April 04, 2011
Advanced Risk Analytics
Some time ago a reader (thank you Tony!) asked me to talk about advanced analytic technology and its use in risk analytics. This unfortunately has taken longer than I would have liked to get to, but has turned out to be fortunate in a peculiar way. Recently and fortuitiously, however, I was approached by Informatica to a discussion with Dr. Ralph Kimball (a prolific author and originator of the Kimball methodology for data warehousing) on big-data analytics (obviously a topic near to my heart - see Big Data and Risk Management). This led to insights on not only the nature of advanced analytic technology but also some interesting discussions on personnel and organizational issues which I will discuss below.
BTW this work has led to a whitepaper by Dr. Kimball that you can find here.
Deep Analytics
Let's first talk about some use-cases that describe the topic of Big-Data Analytics in risk management. Since I feel this topic is more accurately named "Deep Analytics", I will use this term going forward.
- Retail Credit Risk Analytics: A key problem in retail banking credit risk is to understand the statistical properties of a pool of loans to answer questions like - how many mortgages are likely default and when they do, what is the likely recovery rate of these mortgages?
- Fraud Analysis: The same use-case can be seen in the context of fraud-analysis. It is useful to look a pool of transactions (credit card transactions for example) after the fact and identify those that are potentially fraudulent. A common pattern is for card thieves to test cards with small purchases at gas pumps before blowing out the card on large stereo equipment purchases.
- Anti-Money Laundering (AML) and Trade Surveillance: The same use-case as two can be applied to wire-transfers or trades, to detect patterns of non-compliance. A common money-laundering pattern is Structuring, where the perpetrator tries to get around bank reporting requirements for large-cash transactions by making many smaller deposits into different branches and potentially different accounts.
Note that (1) is qualitatively different from (2) and (3). In the credit risk case, one is trying to get an understanding of the statistical property of a group of detailed data. Such analysis can be useful in, say, understanding the riskiness of an asset-backed security backed by a pool of mortgages (something that has kept many fixed-income analysts busy these past few years!). This is also the kind of analysis that is required for Basel II retail risk capital calculations.
(2) and (3) address urgent and specific matters in that they are looking for particular transactions or trades that fit certain criteria. In these cases, there is a need to develop a model of behavior that will detect the event that is likely to happen - in the credit risk case, the goal is to detect default, while the goal of AML may be to find instances of terrorist financing. Note that while the above use-cases have been discussed in terms of after-the-fact analysis, in most cases the same models can be used to detect events before they happen. A common process is to develop these models on historical data and then apply them to transactions as they happen. For example, the fraud models are routinely applied to credit card transactions during authorizations, and potentially fraudulent transactions are refused before they even happen (a fact that I'm sure more than one reader may be aware of to their frustration).
It's clear that while the underlying theme of these three use-cases revolves around the use of statistics, the business processes they represent are in fact quite different from each other. These differences in business processes may lead to different requirements being placed on technology, which I will describe to later.
Another point I want to note here is that these techniques are being used for much more than risk management. Do you want to know which of the many customers coming to your web-site is most susceptible to that great re-financing rate you have on just now? Deep analytics can tell you - in fact many banks use the very same technologies in their marketing departments as their risk management teams.
Technology Choices
Deep analytics presents an esoteric pocket of software choices. For many years vendor driven solutions dominated this space - some examples of vendor solutions being SAS, SPSS (now owned by IBM) and Matlab. Significant communities of users grew up around these tools, with a lot of contributed software to address specific needs. Lately, the open-source R-project has developed a legion of adherents. In comparison to its high-priced competition, R is available for free from a variety of download sites. A large number of packages have been developed to do anything from basic statistics to advanced time-series and graph analysis.
The common requirement in all these technologies is the need to use a programming language, supported by particular libraries, that is specific to the tool of choice. R, for example, is a language that allows users to do all manner of data analysis - including sophisticated graphical and statistical analysis - using extensions called packages.
The common theme with all these packages is that they typically go against relational database tables (or files organized as rows and columns) to create datasets from which they generate analyses. This is changing as well. One of the things I learned from my discussion with Dr. Kimball is that many Internet firms are using MapReduce technology to do the same kinds of statistical analysis.
The two faces of Deep Analytics
One of the confusing things about Deep Analytics is that there are two very distinct phases to the process - with different requirements of extent of data, timeliness and computational requirements - which lead to very different technologies and solutions.
- First, there is a phase of Model Development, where the risk analyst uses historical data to determine the factors that can help detect the event. The main focus of this phase is in discerning patterns from data that have predictive capability. The way analysts work with data to come up with these patterns is to study records as groups of data. For example, a set of mortgage applications may be studied to build a fraud model - in this case the mortgage application will be studied along with all pertinent information such as applicant data such as FICO score, as well as data relating to the property on which a mortgage is being requested. A key input in is the historical information on whether the mortgage was fraudulent or not (known as "goods" and "bads"). Using this information, analysts can determine which data elements in the groups reliably identified a mortgage as fraudulent. Of equal importance to the model's ability to predict fraud will be the ability to avoid the prediction for non-fraudulent mortgages - the so-called "false-positive problem".This sort of analysis requires access to large amounts of historical data, and the ability to create "bags" of data elements that are related to each other. In relational databases, different elements (like mortgage application and applicant information in the example above) may be held in different tables, which then require a join operation to create the "bag" of data that can be analyzed. Since relational databases have historically been inefficient at these sorts of operations (but not anymore - see here), statistical software packages have developed modules to create efficient data-sets on disk-based data stores outside the database. The recent advent of open-source, massively parallel MapReduce systems like Hadoop have yielded another alternative to these packages in the form of commonly used languages like Java and Perl as well as new parallel-enabled data-processing languages like Pig and Hive. Pig, in fact has a concept called "Bag" that is explicitly designed to deliver the kind of functionality described above.
- Once a model has been developed it must be Deployed. Systems for model development share few of the characteristics required for model deployment. Whereas in the development process, timeliness is not critical (models may take minutes or even hours to run) since it's only the lone analyst who's waiting on the results, in this phase speed is of the essence as these models are typically connected to operational systems which are waiting on the results of a model run (for example, a web-page with mortgage application would be linked to a model on the back-end to determine whether to move the application forward). Also, historical data is typically not required in deployed models except in exceptional cases. This means that the process of promoting developed models to deployment is rarely a seamless one. Often, models need to be re-programmed in a language that is more suited for a deployment system (for example, SAS or R models are re-coded in Java for deployment). This is an expensive and time-consuming process. Since programs are being re-written, extensive testing is required when they are completed to ensure that they faithfully replicate the logic of the originally developed model. Some progress has been made in this regard with automated translation programs (one that I know of is from Dulles Software). On the MapReduce systems, this issue is significantly mitigated especially for systems written directly in Java or other structured programming language.
Data Requirements and Architectural Constructs
The specific data requirements of model development force a certain style of architectural construct. Model development requires massive amounts of historical data as well as the ability to flexibly create data-sets of bags of data. In addition to the fact that these data-sets can grow very large, over-time they typically have data that is highly redundant as analysts experiment on different models with largely, but not entirely, the same data. The problem is exacerbated when different analysts work on the same underlying data but don't know quite what each other is working on - this often leads to redundant data-sets of exactly the same information.
In addition, the need to join disparate sets of data and create a desired data-set usually puts immense pressure on traditional relational databases. It is not uncommon to see these large queries bring a database to a standstill and prevent it from responding to urgent operational requests. Therefore, a common practice is to create a "sand-box" copy of data into another relational database in order to serve the exclusive needs of deep analytics, creating further data redundancy. This problem of redundant data is severe in large analytic shops. Often, the disk-based data stores will be several times the size of the underlying data in the database (I've often seen SAS data-stores with 5-10X of the data in the database). Aside from the wasted disk-space and processing power required to move data from the data-base to these data-stores, there is the problem of data consistency to deal with.
The need to address the data redundancy problem has led to the development of a technique called "in-database analytics", where models are run inside database engines without the need to create external data-sets. This depends critically on the ability of databases to perform adequately, which means that this technique is only really viable for Massively-Parallel relational databases (as an example see here). And given the scalability of MPP Relational Databases, the need to duplicate data to another "sandbox" is significantly mitigated if not entirely eliminated.
Note that this redundancy problem does apply to MR oriented data-stores as well, though in a somewhat indirect form. MR systems execute programs directly against data on disk, and are massively scalable, mitigating some of the need for data duplication. But the high-redundancy requirements of these systems mean that they need to persist data created in intermediate steps (such as the collections of data "bags"). I would imagine that the explosion of data with its concomitant consistency issues would be similar to the "data forests" seen in traditional analytic shops.
Conclusion
It's clear that deep analytics have a prominent place in risk management. While this trend is accelerating due to business factors, the technology landscape to handle these issues is changing in exciting ways as well. One would expect to see much more in this area both from technology innovators - vendors and open-source - as well as financial firms deploying these technologies in exciting new risk management solutions.
Posted by dkrishna at 06:28 PM
| Comments (1)