Created 03/05/2018 at 2018:10AM
This is part 1 of implementing Spark.
Big Data and specifically the need for real time analytics In the beginning there was Hadoop/MapReduce and all was well with the world for the needs of data to be analyzed over “X” timeframe was all that anyone needed……for a while. The need to analyze data in real time however, has become extremely important and to fill the bill is Spark. Here’s a breakdown of how much data social media generates on it’s own… https://wersm.com/how-much-data-is-generated-every-minute-on-social-media/
As technology has scaled and grown, it became pretty obvious that the ways that we need to process this data needed to change. My own project at Comcast revolved around Video-on-demand and handling the challenge of an Oracle database that grew at a rate of 7 million records per day or 4,861 records per minute ( in 7 record blocks ). Social media is the the most obvious example of overwhelming data but industries such as banking, government, healthcare and the stock market generate massive amounts of data at any given minute.
Fundamentally we are talking about the issues of batch processing versus real time processing. Hadoop is based on batch processing of big data which means data is stored over a period of time and then processed. Spark allows analysis of data in real time and, if that isn’t enough, Spark operates up to 100 times faster. This primarily boils down to the fact that, while Hadoop operates on HDFS (disk based), Spark operates in memory. Just a note about the “100 times faster” stat, this is based on Spark’s running in memory stat versus Hadoop running on HDFS, so it’s more of an apple to bowling balls comparison. True equivalency is both running on a disk benchmark and Spark running on disk versus Hadoop’s HDFS yields a speed advantage to Spark of only 10 times ( “Only” he says. ;-) )
Other advantages of Spark include…..
Polyglot. Spark provides high level APIs in Java, Scala, Python and R.
Multiple formats: Spark can handle multiple sources like Parquet, JSON, Hive and Cassandra.
Lazy evaluation: Spark delays this until necessary. Only when the driver requests data does the Directed Acyclic Graph (DAG) get executed. If you are curious about DAG, read this
Hadoop integration: Spark can be used in cooperation with Hadoop. It functions as a replacement for MapReduce but can run on top of an existing Hadoop cluster using YARN for resource scheduling.
Machine Learning: Spak’s MLib is the machine learning component that functions as a united engine. No more is there a need for both a processing and a machine learning solution. Lib has both.
Next tutorial will cover Sentiment Analysis using Spark.