Big Data In HPC: Back To The Future
By Guest Blogger, Steve Conway, IDC Research Vice President, High Performance Computing
Data-intensive applications for things like cryptography and weather forecasting have been part of high performance computing (HPC) from the start, ever since the 1950s. But the power of today’s HPC systems — large clusters and purpose-built supercomputers — has made it feasible to tackle bigger versions of familiar tasks and a host of previously intractable problems.
The challenges cover a broad spectrum, including fraud detection, anti-terrorist analysis, social and biological network analysis, semantic analysis, financial and economic modeling, drug discovery and epidemiology, weather and climate modeling, oil exploration, power grid management, and many other areas. The common denominator is that the problems are large and complex enough to require modeling/simulation using HPC resources.
Where HPC is concerned, IDC defines data-intensive (“big data”) problems broadly to include tasks involving sufficient data volumes and complexity to require HPC-based modeling/simulation. The problems can employ structured data, unstructured data, or both. They can come from traditional HPC domains in government, industry and academia–or they can be upward extensions of commercial problems that have grown large and complex enough at the high end to require HPC. In addition, “big data” can accumulate from the multiple results of iterative problem-solving methods in sectors such as manufacturing (parametric modeling) and financial services (stochastic modeling). So, small and medium-size enterprises (SMEs) are also encountering “big data” challenges.
Some problems involve “finding a needle in a haystack,” that is, locating a discrete item that already exists in a database. This style of problem-solving usually employs relational databases (RDBMS) and traditional search methods.
Other problems are more complex and involve “finding patterns in shifting sand.” Problems of this kind tend to involve unstructured (NoSQL) data and newer methodologies and special software frameworks such as MapReduce and Hadoop. They involve similar tasks: pattern matching, scenario development, behavioral prediction, anomaly identification, and analysis of relationships using graphs. They’re for things like catching terrorists before they leave the airport, or catching bank fraud before the criminal gets the money, or protecting the US power grid before it crashes. Some of the powerful algorithms in this domain originated in classified government.
The stakes can be high in relation to economic value, competitiveness or national security. Take fraud detection as an example. Business fraud detection could save millions of dollars, and government fraud detection could save billions. Recently, EBay bought supercomputers to combat fraud in the PayPal system. Italy’s big government agency, INPS, acquired a supercomputer to attack health care fraud on a national basis.
The U.S. may be heading in the same direction. The FBI estimates that 10% of transactions in federal health care programs – Medicare, Medicaid, Veterans Affairs and so forth – are fraudulent, costing about $150 billion a year. Price Waterhouse Coopers thinks it’s three times that amount. Today, the health care data is spread across five gigantic databases. As a result, no one can see all the data at once, fraud is detected after the fact, and the government recovers only about $1 billion a year.
Oak Ridge National Lab has submitted a proposal to unify all these databases and perform fraud detection using a Cray supercomputer nicknamed “Jaguar” that features 224,000 AMD Opteron™ processor cores. This solution could save $50 billion a year by analyzing the data in near-real time. The same methods could be applied to other criminal behavior, terrorist activities and many of the other applications I mentioned.
Most “big data” problems requiring HPC-level solutions will be run on large clusters, but the most daunting data-intensive problems are already going to more purpose-built HPC systems (e.g., the Ebay/PayPal, INPS and U.S. federal health care examples). IDC expects this trend to continue.
In sum, “big data” has long been an important part of the HPC market, but recent technology advances have given data-intensive computing much higher potential as a horizontal market. It’s back to the future, with a new twist.
Steve Conway is Research Vice President, High Performance Computing for IDC. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.
POSTED IN: Cloud Computing
TAGS: AMD, big data, Cloud Computing, Cray, enterprise, HPC, IDC, IT, server, SMB


Lets see… 10% is $150Bn… so the combined budget for the Federal health case programs is… OMG! Preventing the $150Bn fraudulent transactions could pay off the national deficit in about a decade!
Perhaps someone could check the source link on the $150Bn figure and verify it?