eGov & Sectors

Exploring Big Data Analytics

August 22, 2013

A restaurant in New York was running up US$2 to 3 million worth of receipts each month with a credit card company. Strangely, there was no mention of the restaurant on any social media networks. It turned out that no such restaurant existed, and the entire setup was a front for money laundering. Mr Martin Darling from MapR Technologies cited this fraud detection example at an industry sharing session on Big Data organised by the Infocomm Development Authority of Singapore (IDA) on 2 July, to illustrate Big Data analytics in action.

Big Data is defined as any kind of data that exceeds the capacity or capabilities of traditional data storage and processing tools. For example, the data may be too large to be analysed efficiently; it may not be structured as tables (i.e. rows and columns); it may contain data from multiple sources in different formats; or it may be generated too quickly for efficient transmission and analysis. The event gave an opportunity for industry experts to share how Big Data technologies provide tools to address all these challenges, and enable new approaches to data collection, processing and analysis.

One of the early innovations in Big Data was the ability to manage and process large volumes of data from different sources and in different formats, on a set of commodity hardware—servers and networks that can be bought from any credible vendor. This technology, known as Hadoop, was first developed at Yahoo! in 2006, based on earlier work by Google. It was designed to handle the massive volumes of data that these Internet search giants possessed. Today, multiple commercial versions of Hadoop are available, including MapR, that provide businesses from any industry the ability to process large volumes of data.

Mr Martin Darling: With Hadoop, you do not need to know what questions to ask or to define a fixed data structure beforehand.

The magic of Hadoop is in the ability to run large data processing jobs in parallel across hundreds or thousands of servers. In traditional data analytics, data had to be extracted from a data warehouse into an analyst’s own workstation before analysis can be done. This can be costly and time-consuming, and the amount of data that can be analysed is limited by the storage and processing capacity of the workstation. In Hadoop, the data is first distributed across hundreds or thousands of servers, then the data processing tasks are distributed to the servers where the data resides, and is executed there. Instead of moving data to the analyst, the analyst moves processing instructions to the data. After the required processing or analysis is completed, only the results are sent back to the analyst. This approach allows efficient processing of very large data analytics jobs.

Comparing Hadoop with traditional approaches to analytics, Mr Darling highlighted that normally, samples of data had to be taken for analysis due to limitations in computing power and storage capacity. Hadoop’s ability to process large data sets in parallel enables all data available to be used, and diverse data sets to be queried together to uncover new correlations and insights. In addition, Mr Darling said, “You don’t need to know what questions to ask, or define a fixed data structure beforehand. You can change your queries any time, and load in new data from different sources and formats.”

Big Data technologists have also developed other tools to address the challenges of working with different types of data. Unlike a traditional relational database that stores data in rows and columns, a new class of “Not Only SQL” or “NoSQL” databases have been created to efficiently store and process structured and unstructured data of different formats. NoSQL databases generally fall into four categories, namely document databases, columnar databases, key-value databases, and graph databases.

Mr Andreas Kolleger, Product and Experience Designer with Neo Technology, introduced graph databases, which are well-suited for storing relationships in data. A traditional relational database stores data in multiple tables, comprising rows and columns of data, which subsequently have to be joined in order to uncover relationships. This step of joining multiple tables can be very inefficient. A graph database, built on the mathematical notion of a graph, stores data as nodes (an individual data record such as a shipping container or a port) and edges (the relationships between records), for example, which ports a shipping container came from. Mr Kolleger explained, “in a logistics example, a relational database is very good at telling you how many shipping containers are at a port; a graph database is optimised to tell you very quickly where the containers have come from and where they are going to.”

Graph databases are optimised for querying relationships between large numbers of data records. According to Mr Kolleger, one of the world’s largest logistics carriers used graph databases for parcel tracking and real-time routing of 5 million parcels daily.

The two presentations by Mr Darling and Mr Kolleger were well-received by the participants, who got an appreciation for how Big Data technologies can be used to address data storage, processing and analysis challenges in their organisations. By enabling new ways of storing and analysing data, Big Data technologies opens new possibilities for organisations in any sector to derive deep and comprehensive insights from diverse, unstructured and large data sets.