December 27, 2010
Big Data and Risk Management Part 2: Relational Databases
In a previous posting, I discussed the dual trends of big data technology and risk management requirements that are rapidly converging. The two most popular types of big-data technology vying for hearts-and-minds today are the so-called "legacy" Relational Database Management Systems (RDBMS) technology and the new-kid-on-the-block - Map-Reduce (MR) technology first developed at Google and embodied in the open-source Hadoop project. This posting will examine RDBMS systems with subsequent posts addressing MR technology.
Note: Some of the descriptions here are purposely kept simple - perhaps too simple for anyone well schooled in database concepts - in the interest of ensuring that the basics are clear to everyone. This is important when discussing the differences between RDBMS' and MR technology.
Relational Database systems have been around for some time. First described by E.F.Codd in a seminal paper in 1970, they have seen rapid progress in the past 30 years. Numerous vendors have developed RDBMS solutions to the point where they are almost ubiquitous as data storage platforms. Some of the popular database platforms include IBM DB2, Oracle, Sybase and Microsoft SQL Server. The open-source community has not been idle either, having developed MySQL and PostgreSQL to a point where this platform has achieved tremendous popularity (and ironically, themselves have achieved the un-cool "legacy" status).
Relational databases rely on a fairly simplistic way of representing data in a tabular form with data in rows and columns. Data elements in a given row is assumed to have an implicit relationship to each other, while each row contains information that is not explicitly related to any other row. This representation is called a table. Typical databases have many tables with data about different subjects. For example, a credit risk database may have a table representing counter-party data and another representing positional data. Tables can be related to each other via keys. The process of connecting data between tables with a common key is referred to as joining tables. The position table could hold a column with a counter-party ID in it, which could allow the database to join it to the customer table. In this way one can quickly extract information about counter-parties and the positions they hold in one operation. All this should be familiar to anyone who's built an Excel spreadsheet with multiple tabs - each tab is like a table in the database.
The process of asking questions of a relational database is called querying. Querying a relational database requires the use of a stylized language called Structured Query Language (shortened to SQL and pronounced "sequel"). SQL is a largely standardized language which allows users to query databases of various vendors in a consistent way (though vendors may add special-purpose extensions). The other characteristic about SQL is that it is a declarative language. To understand this, consider the previous example of counter-parties and positions. One way of getting a combined view of counter-parties and positions is to prepare a sequence of instructions thus so:
- First, read the counter-party table
- For each row in the counter-party table, find and read the corresponding row(s) in the position table by using the counter-party ID column
- Concatenate the positional rows to the counter-party rows
- Repeat till the end of the counter-party table
SQL takes a different approach. Rather than have a user tell the database
how to fetch the data, the user is asked only to specify
what data she wants. In SQL the same query as above would read simply "I want to see all my counter-parties, with their positions which can be found by joining with the counter-party ID". In SQL, the database does the heavy lifting of interpreting the user's request and deciding how to accomplish the task (indeed, the database may actually perform the task in the sequence described above, but this is transparent to the user asking the question). Beyond the obvious simplicity of this approach, there is another subtle but powerful benefit. Since the database software is now in charge of deciding
how to go about retrieving the data, it can use any knowledge it has about how the data is organized to determine the most efficient path to achieving its goal. This process is called
optimization. While performance may not be a consideration in small data sets, this becomes critical when retrieving data from the very large databases required for enterprise risk management. One way in which databases can accelerate the query process is to use indexing technology. As the name indicates, database indexes are analogous to the ones found at the end of a book - they make searching for the right information immeasurably faster than reading through the entire book to find something you're looking for.
Traditional RDBMS technology works very well when queries are well-understood, and the result-sets are small, such as a bank branch application where tellers repetitively look-up customer and account information. On the other hand, analytically demanding applications like risk management usually push traditional RDBMS' to (and beyond) their limits. These applications usually perform complex computations across a wide range of data elements and time-periods. The queries are usually not repetitive, and often the result-sets are large.
A specialized class of RDBMS technology - Massively Parallel Processing (MPP) databases - is rapidly gaining favor to deal with these kinds of analytic queries. While MPP databases have been around for a while (pioneered by Teradata 30 years ago) they have recently received a lot of attention due to the emergence of a number of new vendors (Netezza, Oracle and Microsoft among others) in this space. The secret sauce in MPP technology is
data distribution. Where traditional RDBMS technology is forced to work on a query one-step at a time because all the data is in one place, MPP technology tries to distribute data on multiple disks so it can
retrieve data in parallel when required in a query (I'm simplifying somewhat but this is the gist of it). MPP databases can be hundreds of times faster (or more) than traditional databases - expect to see them more of them in your risk management future. Not only this, but MPP databases can be easily scaled to meet growing needs. By adding more units of parallelism to an MPP system, one can expect the system to keep up with growing volumes of data and queries without missing a beat.
RDBMS' in Risk ManagementSo where do RDBMS' feature in risk management? Pretty much everywhere, it turns out. Relational databases are at the heart of all large data-warehouses that are used for such purposes as Basel II reporting, market and liquidity risk analysis. Traditional RDBMS technologies have largely been used for this purpose. But driven by factors such as regulatory reform and a heightened need for risk management, risk databases have been challenged to become bigger and faster to a point where such databases cannot function effectively. As a result, larger financial firms are turning to MPP databases to fill their needs. These databases are required to simultaneously address several requirements:
- Traditional regulatory and firm reporting
- Ad-hoc data and statistical analysis for purposes such as stress-testing and cohort analysis
- Dash-boarding
- Serving up data to functions such as risk-based pricing and performance measurement
Tomorrow's risk databases will be required to do much more than has been asked of them so far. Given the maturity and scalability of MPP technology, it is all but inevitable that MPP RDBMS' will supplant traditional RDBMS' in the risk technology field in the years to come.
Next, I examine the implications of MapReduce technology and what it means to risk management.
Posted by dkrishna at 03:54 PM
| Comments (2)
Big Data and Risk Management Part 3: Map-Reduce Technology
In Part 2 of this series, I discussed the Relational Database systems as a mature technology to store analytic data for risk management. While RDBMS are ubiquitous in this space, there is a technology paradigm shift that is now occurring to Massively Parallel Processing (MPP) databases which is going to offer enormously improvements in scalability and performance. In fact, I predict that MPP databases will largely supplant traditional RDBMS' in the field of risk management data in the coming years.
The past few years have seen the rise of what a new technology that promises to challenge this assertion, however. MapReduce technology - originally developed at Google and now developed in the open-source project Hadoop - has had breathless adherents all-but pronounce the death-knell of RDBMS technology. In this posting, I will decipher what this technology means, and how it both competes and cooperates with RDBMS technology.
MapReduce was first publicly introduced in a much-quoted paper by Dean and Ghemawat at Google in 2004. The technology had been used at Google to index very large quantities of data that form the basis of the famously fast Google search mechanism. In the next couple of years, Doug Cutting and team at Yahoo evolved these concepts into an open-source implementation dubbed Hadoop (named after Cutting's son's stuffed elephant). The technology - a top-level Apache project - has caught the imagination of geeks everywhere - the two years since 2008 have seen the adoption of Hadoop by a large and growing number of companies (see here for a long list).
How does MapReduce Work
MapReduce (or MR for short) is first and foremost a programming paradigm. MapReduce programs run in two parts. The first, or Map function, reads a set of “records”, does any desired filtering and/or transformations, and then outputs a set of records of the form (key, data). The second - Reduce - function reads the key-value pairs and performs aggregation tasks. The key to MR's scalability is that there can simultaneously be many Map or Reduce jobs running. However, the MapReduce paradigm guarantees that all map outputs for the same key will be passed to the same Reduce job, no matter which Map job produced them.
To take a trivial example, consider the problem of aggregating exposures by counter-parties. Assume that the data is in multiple files, say from different systems based on region and product. One can setup a Map function to scan the file and produce output in the form of <counter-party ID, exposure>. This output will then be passed by the MR system to a reduce job, which only has to aggregate exposures for each counter-party via a simple summation function. Since a single counter-party's exposures will be sent to one and only one Reduce function, combining the outputs of all Reduce jobs will result in a exposure report by counter-party.
The power of MR is in the fact that Map jobs can all run in parallel, as can Reduce jobs. This means that large tasks that can be easily reduced into smaller concurrent jobs can be run on massively parallel arrays of processors (think 1000s of concurrent processing units) at a fraction of the time it would take to run the job in a single-threaded manner. This approach can aggressively reduce the cost of big jobs. First, given the plummeting cost of hardware (CPU, Disk, Memory) one can put together huge arrays of computing power at comparatively low cost. Second, open-source MR software (i.e. Hadoop) comes with Zero Acquisition Cost (more on the italics later). This is seductively interesting because it allows companies to play with MR technology at very low entry cost - just cobble together a bunch of servers that are lying around and install Hadoop on it to get going. No budget cycles and business cases to worry about.
While MR technology has proven to be cheap to work with, it does have some drawbacks. Zero Acquisition Cost is only a component of the cost of running a data plant. There is data center space to consider, as well as energy and resource costs of managing a large number of servers. These can be quite significant at the scale of servers that large MR plants require. This challenge is magnified due to the apparent lower performance of MR technology as compared to MPP relational databases, as discussed in a paper by Dewitt, Stonebraker et al. While Moore's law and the inevitable march of software technology will doubtless narrow the rather large gaps in performance between the two approaches, the fact remains that currently it requires upto 50 times the number of servers (especially for complex query operations like joins) to achieve like performance, with all the attendant data center and energy costs. So it does appear that reports of RDBMS' demise are very much premature, if even accurate.
The performance considerations of MR technology are very important to consider, especially in the context of the problem they were developed to solve. In the main, MR and associated technologies have been designed at Web-oriented companies to deal with the problem of indexing and searching through large volumes of textual data (such as web-server log files). The problems of risk management are qualitatively different, and this fact must be kept in mind when deciding how to use MR and related technologies.
MapReduce, NoSQL and No-Schema
A key "feature" of Hadoop is that the Map and Reduce functions are coded in Java. On the one hand, this brings great flexibility to a programmer in the logic that can be implemented. Anyone who's spent time developing analytics for any length of time has wrestled with the peculiarities of Structured Query Language (SQL). SQL is a declarative language, which is a fancy way of saying that an analyst only has to (and only can!) tell the system what data she wants, not how to produce it (for an example see here). This approach irks programmers (such as yours truly used to be, many moons ago) - often SQL produces a feeling of utter helplessness in users who can't get the data to come out exactly right. Programming a set of sequential instructions can be much more flexible, leading programmers to deride the use of SQL.
Hadoop's use of Java to perform data manipulation and retrieval tasks is viewed as a boon by Java programmers who now have a high-degree of control over what they need to the base datasets to get the information they need. No more monkeying around with SQL - any data extraction and transformation can be written simply by virtue of the fact that a low-level programming interface is being used.
Integrating data into an RDBMS requires a Schema. The Schema of a database refers to the organization of data within it. SQL depends strongly on the existence of a well-understood schema that can be queried by users who don't have to worry about the idiosyncrasies of way data is organized. The big issue with schema in an RDBMS is that data has to be first organized into the schema before it can be accessed by SQL. This process (technically known as extract-transform-load or ETL) is challenging for many reasons - most often because data from source systems is never in a neat form that can fit into the schema. This means long cycles of reviewing requirements and developing the right transformations into the appropriate schema. And even before loading data into the schema, there is a long and arduous effort required to create the schema in the first place.
For all the animated discussion sorrounding MR technology, NoSQL and "NoSchema", the fact is that these concepts have come together more by happenstance than anything else. Nothing in MR precludes the use of a particular well-understood data organization scheme (aka Schema). If the data has to be queried by a wide group of users a schema is a must, and technology to create schema (called HBase) is in fact being developed under the Hadoop project. And as for SQL, a project called Hive is actively being developed to put an SQL interface on top of Hadoop.
An important point of clarification must be made here. While the hitherto approach of totally unstructured data is being rapidly backtracked from, the "database" schemes that are being developed in systems like HBase are different from relational schema in a very important way. Relational schema cannot easily hold hierarchical data due to their flattened structure. For example, in a relational table of counter-parties, the relationship between legal entity structures of a holding company with it's subsidiaries is usually contained in a complicated series of index columns, which requires a complex set of queries to unravel. Database systems like HBase are attempts at challenging this sometimes severe limitation in SQL. Given these late-breaking developments, some industry observers have begun to re-label "NoSQL" as NOSQL as in Not-Only-SQL. The development of hierarchical databases is a very interesting trend that bears watching, though at this point it's not clear to me how such a complex schema can be effectively optimized for fast query-response.
Storm in a teacup?
My view on this is that much of the animated discussion and exalted claims of traditional NoSQL's superiority due to lack of schemas come from programmers who have not had to deal with real problems of delivering consolidated information to a wide range of business users, very few (if any) of whom have any facility in programming. These users nonetheless need tangible interaction with data and cannot suffer to have a Java programmer be a gateway to information. Like it or not, data does need to be transformed - different business lines in a large firm have divergent descriptions that need to be conformed for reporting and analysis. This transformation can either be done in the ETL layer (which takes time to do thoughtfully to ensure that the transformation applies to all users of the data) or it can be done in an MR program, where the programmer needs to only worry about his own requirements and therefore can accomplish the transformation quickly. The MR approach is seductive but false - each query will need it's own transformation. So while the first query will be done much faster than the equivalent query in a RBDMS (think hours vs. months), the second and subsequent query will take exactly the same time as the first in a NoSQL system, while it an RDBMS can churn out responses to queries in great volumes once it's built. If this reminds you have comparisons between a cottage industry and an industrial plant, you are on the right track. Artisans (aka Java programmers) can produce exquisite works of customized art without the overhead and lead time to build a factory, but once an industrial plant is completed its sheer ability to churn out large quantities of cheap, well-made goods will far outstrip the ability of even an army of artisans.
When it comes to risk management, the requirements strongly favor a factory-based approach. Risk information is primarily about two things. First, the ability to produce reports that are consistent with each other and consistent across time - the last thing a portfolio manager wants to see is reports that don't agree with each other or show unexplained divergencies from one quarter to the next. Second, risk information requires flexibility - the ability to perform all manner of ad-hoc queries to tease out potential risks or to understand the impact of imagined risk events. The consistency requirement demands a well-understood schema and strongly enforced rules on transformations so that all reports can be produced with confidence. The ad-hoc query requirement implies the ability to throw a large number of queries rapid-fire at data. Both cases, therefore favor the SQL and Schema approach. Note, however, that this does not necessarily invalidate MR technology itself, nor indeed new "NOSQL" developments like hierarchical databases.
In the last (and final) part of this series, I will lay out my ruminations on the various players in big data technology could be effectively used to address the rather significant risk management challenges that lie ahead.

Posted by dkrishna at 03:53 PM
| Comments (1)