Exchange Ideas

The interactions between risk management and technology

Focus both on technology developments that can help improve risk management, but also on other aspects such as operational and potentially strategic risks caused by advances in technology

 

« Big Data and Risk Management Part 3: Map-Reduce Technology | Main | Big Data and Risk Management Part 4: Applications »

December 27, 2010

Big Data and Risk Management Part 2: Relational Databases

In a previous posting, I discussed the dual trends of big data technology and risk management requirements that are rapidly converging. The two most popular types of big-data technology vying for hearts-and-minds today are the so-called "legacy" Relational Database Management Systems (RDBMS) technology and the new-kid-on-the-block - Map-Reduce (MR) technology first developed at Google and embodied in the open-source Hadoop project. This posting will examine RDBMS systems with subsequent posts addressing MR technology.

Note: Some of the descriptions here are purposely kept simple - perhaps too simple for anyone well schooled in database concepts - in the interest of ensuring that the basics are clear to everyone. This is important when discussing the differences between RDBMS' and MR technology.

Relational Database systems have been around for some time. First described by E.F.Codd in a seminal paper in 1970, they have seen rapid progress in the past 30 years. Numerous vendors have developed RDBMS solutions to the point where they are almost ubiquitous as data storage platforms. Some of the popular database platforms include IBM DB2, Oracle, Sybase and Microsoft SQL Server. The open-source community has not been idle either, having developed MySQL and PostgreSQL to a point where this platform has achieved tremendous popularity (and ironically, themselves have achieved the un-cool "legacy" status).

Relational databases rely on a fairly simplistic way of representing data in a tabular form with data in rows and columns. Data elements in a given row is assumed to have an implicit relationship to each other, while each row contains information that is not explicitly related to any other row. This representation is called a table. Typical databases have many tables with data about different subjects. For example, a credit risk database may have a table representing counter-party data and another representing positional data. Tables can be related to each other via keys. The process of connecting data between tables with a common key is referred to as joining tables. The position table could hold a column with a counter-party ID in it, which could allow the database to join it to the customer table. In this way one can quickly extract information about counter-parties and the positions they hold in one operation. All this should be familiar to anyone who's built an Excel spreadsheet with multiple tabs - each tab is like a table in the database.

The process of asking questions of a relational database is called querying. Querying a relational database requires the use of a stylized language called Structured Query Language (shortened to SQL and pronounced "sequel"). SQL is a largely standardized language which allows users to query databases of various vendors in a consistent way (though vendors may add special-purpose extensions). The other characteristic about SQL is that it is a declarative language. To understand this, consider the previous example of counter-parties and positions. One way of getting a combined view of counter-parties and positions is to prepare a sequence of instructions thus so:

  1. First, read the counter-party table
  2. For each row in the counter-party table, find and read the corresponding row(s) in the position table by using the counter-party ID column
  3. Concatenate the positional rows to the counter-party rows
  4. Repeat till the end of the counter-party table
SQL takes a different approach. Rather than have a user tell the database how to fetch the data, the user is asked only to specify what data she wants. In SQL the same query as above would read simply "I want to see all my counter-parties, with their positions which can be found by joining with the counter-party ID". In SQL, the database does the heavy lifting of interpreting the user's request and deciding how to accomplish the task (indeed, the database may actually perform the task in the sequence described above, but this is transparent to the user asking the question). Beyond the obvious simplicity of this approach, there is another subtle but powerful benefit. Since the database software is now in charge of deciding how to go about retrieving the data, it can use any knowledge it has about how the data is organized to determine the most efficient path to achieving its goal. This process is called optimization. While performance may not be a consideration in small data sets, this becomes critical when retrieving data from the very large databases required for enterprise risk management.  One way in which databases can accelerate the query process is to use indexing technology. As the name indicates, database indexes are analogous to the ones found at the end of a book - they make searching for the right information immeasurably faster than reading through the entire book to find something you're looking for.

Traditional RDBMS technology works very well when queries are well-understood, and the result-sets are small, such as a bank branch application where tellers repetitively look-up customer and account information. On the other hand, analytically demanding applications like risk management usually push traditional RDBMS' to (and beyond) their limits. These applications usually perform complex computations across a wide range of data elements and time-periods. The queries are usually not repetitive, and often the result-sets are large.

A specialized class of RDBMS technology - Massively Parallel Processing (MPP) databases - is rapidly gaining favor to deal with these kinds of analytic queries. While MPP databases have been around for a while (pioneered by Teradata 30 years ago) they have recently received a lot of attention due to the emergence of a number of new vendors (Netezza, Oracle and Microsoft among others) in this space. The secret sauce in MPP technology is data distribution. Where traditional RDBMS technology is forced to work on a query one-step at a time because all the data is in one place, MPP technology tries to distribute data on multiple disks so it can retrieve data in parallel when required in a query (I'm simplifying somewhat but this is the gist of it). MPP databases can be hundreds of times faster (or more) than traditional databases - expect to see them more of them in your risk management future. Not only this, but MPP databases can be easily scaled to meet growing needs. By adding more units of parallelism to an MPP system, one can expect the system to keep up with growing volumes of data and queries without missing a beat.

RDBMS' in Risk Management

So where do RDBMS' feature in risk management? Pretty much everywhere, it turns out. Relational databases are at the heart of all large data-warehouses that are used for such purposes as Basel II reporting, market and liquidity risk analysis. Traditional RDBMS technologies have largely been used for this purpose. But driven by factors such as regulatory reform and a heightened need for risk management, risk databases have been challenged to become bigger and faster to a point where such databases cannot function effectively. As a result, larger financial firms are turning to MPP databases to fill their needs. These databases are required to simultaneously address several requirements:
  • Traditional regulatory and firm reporting
  • Ad-hoc data and statistical analysis for purposes such as stress-testing and cohort analysis
  • Dash-boarding
  • Serving up data to functions such as risk-based pricing and performance measurement
Tomorrow's risk databases will be required to do much more than has been asked of them so far. Given the maturity and scalability of MPP technology, it is all but inevitable that MPP RDBMS' will supplant traditional RDBMS' in the risk technology field in the years to come.

Next, I examine the implications of MapReduce technology and what it means to risk management.


Posted by dkrishna at December 27, 2010 03:54 PM

Comments

Excellent article......I would have loved if you addressed the advantages of using analytical packages such as SAS, SPSS and R over using the querying functionality (SQL) of RDBMS applications. Specifically, where do you think these analytical packages fit in data and risk management?

Posted by: Tony Awoga at January 1, 2011 03:46 AM

Tony, thanks for the interest. Great topic that I'll address in a posting in the near future.

Posted by: Dilip Krishna at January 11, 2011 11:22 AM

Post a comment




Remember Me?

(you may use HTML tags for style)