Created 03/05/2018 at 2018:10AM

This is part 1 of implementing Spark.

Big Data and specifically the need for real time analytics In the beginning there was Hadoop/MapReduce and all was well with the world for the needs of data to be analyzed over “X” timeframe was all that anyone needed……for a while. The need to analyze data in real time however, has become extremely important and to fill the bill is Spark. Here’s a breakdown of how much data social media generates on it’s own…

As technology has scaled and grown, it became pretty obvious that the ways that we need to process this data needed to change. My own project at Comcast revolved around Video-on-demand and handling the challenge of an Oracle database that grew at a rate of 7 million records per day or 4,861 records per minute ( in 7 record blocks ). Social media is the the most obvious example of overwhelming data but industries such as banking, government, healthcare and the stock market generate massive amounts of data at any given minute.

Fundamentally we are talking about the issues of batch processing versus real time processing. Hadoop is based on batch processing of big data which means data is stored over a period of time and then processed. Spark allows analysis of data in real time and, if that isn’t enough, Spark operates up to 100 times faster. This primarily boils down to the fact that, while Hadoop operates on HDFS (disk based), Spark operates in memory. Just a note about the “100 times faster” stat, this is based on Spark’s running in memory stat versus Hadoop running on HDFS, so it’s more of an apple to bowling balls comparison. True equivalency is both running on a disk benchmark and Spark running on disk versus Hadoop’s HDFS yields a speed advantage to Spark of only 10 times ( “Only” he says. ;-) )

Other advantages of Spark include…..

Next tutorial will cover Sentiment Analysis using Spark.