The connected world may be shrinking by the day, but the digital universe is expanding at a mind-boggling rate. Organizations now handle data in the range of terabytes and petabytes. This data looks nothing like what RDBMS traditionally dealt with. New distributed databases, known by the umbrella term NoSQL, help in the efficient handling of this unstructured and scaling data.
In Part 1 of this series, we learnt how to set up a Hadoop cluster on Azure HDInsight and run a Spark job to process huge volumes of data. In most practical scenarios, however, such jobs are executed as part of an orchestrated process or workflow unless the need is for a one-time processing. In our specific use case, we had to derive different metrics related to error patterns and usage scenarios from the log data and report them on a daily basis.
Before we delve into the interesting part, let me set the context first. The problem we had in hand was to do some data crunching on the log data for one of our client applications, to analyze and report on the various client-defined metrics from the application logs. The application under consideration had a user base of more than 100K users, which meant millions of rows of data to process on a daily basis. Clearly, we were dealing with “big data.” Considering the volume of data involved, we decided to go with Spark running on an Azure HDInsight cluster to benefit from the increased performance offered by Spark’s in-memory RDDs (Resilient Distributed Datasets).
In the face of growing healthcare challenges such as an aging population, chronic diseases, and high cost of hospitalization, wearable patient monitoring (WPM) systems create new opportunities for improving patient care.
From modish wearables that track general fitness, these systems have matured to medical-grade devices that can monitor chronic diseases and other medical conditions. Wearables fitted with advanced biosensors and integrated with a robust IoT platform for analysis and communication constitute a potential solution for early detection of clinical deterioration, timely response by medical staff, and appropriate medical intervention. (more…)
Hierarchical data visualized in a collapsible tree format can grow rapidly and fill the screen with nodes, compromising on readability. Particularly so when a node has hundreds of child nodes. This article proposes a pagination mechanism where a fixed number of nodes are displayed at a time, and the user is allowed to move between previous and subsequent nodes. It describes the implementation of pagination in D3.js.
The Internet of Things is about all things around us connected and communicating to make our lives simpler and efficient. You can control your home using your mobile phone even from the other end of the world. This infographic depicts some of the cool things that happen in a smart home.
Insurers are turning to big data analytics to strike a difference in the highly commoditized insurance market and improve risk management in the context of growing regulations. This move is helping the industry reap rich benefits across the value chain. In this white paper we examine some of the emerging practices in the Property and Casualty insurance sector particularly in relation to product pricing, underwriting, claims handling, customer relationship management, and reinsurance.
This is the last of my three-part series on MongoDB and here, let’s take a look at sharding.
Sharding is the process of horizontal scaling where data set is divided among multiple servers or shards. Each shard would be an independent set, but all the shards together would form a complete logical collection defined for the application.
Data mining does not stop with data warehousing, big data analytics tools and visualization for IT decision-makers, who are equally concerned with business metrics such as machine requirement, performance, cost, human resources and ROI. Predictably, reducing IT cost figures as a recurring motive among companies doing big data. Intel’s research highlight that 20 percent of firms aim at lowering IT cost, making it the third most important driving factor in big data investment.
Hadoop and assorted Apache open source projects have taken the big data space by storm as a cost effective alternative. As Doug Cutting of Hadoop fame puts it “Competing against open source is a tough game—everybody else is collaborating on it; the cost is zero. It’s easier to join than to fight.”