Based on statistics published by DataBricks, the top 3 applications using Apache Spark are the following.
Business Intelligence (BI) is deriving presentable & actionable information to help corporate executives, business managers & other stack holders to make an informed business decision. Benefits of BI includes - accelerating, improving decisions and finding new opportunities.Customer Intelligence (CI) is the information derived from customer data that an organization can use to understand customer needs & serve better.
Before Spark, accessing few days of data took 24 hours. And, after using Spark, a year’s data get processed in a 10 min coffee break.
Now, because of real quick BI & CI due to Spark, businesses have a real competitive edge over their rivals. The simplest example is – Knowing customers early is an unparalleled advantage over your rivals.
Traditional data warehouses are great for structured data. But, the current trend of data consists of 4 V’s (Volume, Velocity, Variety, and Veracity). Data is coming from various sources like smartphones, sensors, social media, log, transactions etc.Your competitive edge is processing it faster. Data warehouses built using Spark-SQL provide a capability to address 4 V’s & gives an edge over other competitors.
Organizations get data from various sources in real-time like sensors, mobile, IoT devices, twitter, online transaction. All these data need to be monitored & processed. So, the need of the hour is large-scale, real-time data processing capability.Streaming ETL – Data is continuously cleaned and aggregated prior to pushing it to stores. Spark Streaming solutions is used by companies like Pinterest to provide live insight how users are engaging with Pins across the world. Based on this Pinterest’s recommendation engine show more related pins.
Other applications that use Apache Spark are - RECOMMENDATION ENGINES, LOG PROCESSING, USER FACING SERVICES & FRAUD DETECTION
Once a microservice is deployed in a container it shall be scheduled, scaled and managed independently. But when you are talking about hundreds of microservices doing that manually would be inefficient. Welcome Kubernetes, for doing container orchestration ...
Big Data is a problem statement & what it means is the size of data under process has grown to 100's of petabytes ( 1 PB = 1000TB ). Yahoo mail generates some 40-50 PB of data every day. Yahoo has to read that 40-50 PB of data & filter out spans. E-commerce...
Data needs computation to get some information out. Size of data can be really huge. Huge data is broken down into chunks & stored across different systems.
difference between big data and spark, relationship between big data & spark