Big Data Processing with PySpark Training

Big Data Processing with PySpark Course:
PySpark is an API developed in python for spark programming and writing spark applications in Python. PySpark allows data scientists to perform rapid distributed transformations on large sets of data. Apache Spark is open source and uses in-memory computation. It can run tasks up to 100 times faster,when it utilizes the in-memory computations and 10 times faster when it uses disk than traditional map-reduce tasks.

Big Data Processing with PySpark Course Curriculum

1. Understanding Spark

What is Apache Spark?

Difference between Hadoop & Spark

Spark Jobs and APIs

Execution process

Resilient Distributed Dataset

DataFrames

Datasets

Catalyst Optimizer

Project Tungsten

Spark 2.0 architecture

Unifying Datasets and DataFrames

Introducing SparkSession

Tungsten phase 2

Structured Streaming

Continuous applications

2. Resilient Distributed Datasets

Internal workings of an RDD

Creating RDDs

Schema

Reading from files

Lambda expressions

Global versus local scope

Transformations

Actions and Transformations

Transformations

General transformations

Math/Statistical transformations

Set theory/relational transformations

Data structure-based transformations

map function

flatMap function

filter function

coalesce

repartition

Actions -reduce,count,collect,Caching

Loading and saving data

Loading data

textFile

wholeTextFiles

Load from a JDBC Datasource

Saving RDD

Aggregations - groupByKey, reduceByKey, combineByKey, and aggregateByKey

3. DataFrames

Python to RDD communications

Catalyst Optimizer refresh

Speeding up PySpark with DataFrames

Creating DataFrames

Generating our own JSON data

Creating a DataFrame

Creating a temporary table

Simple DataFrame queries

DataFrame API query

SQL query

Interoperating with RDDs

Inferring the schema using reflection

Programmatically specifying the schema

Querying with the DataFrame API

Number of rows

Running filter statements

Querying with SQL

Number of rows

Running filter statements using the where Clauses

DataFrame scenario – on-time flight performance

Preparing the source datasets

Joining flight performance and airports

Visualizing our flight-performance data

Spark Dataset API

4. Prepare Data for Modeling

Checking for duplicates, missing observations, and outliers

Duplicates

Missing observations

Outliers

Getting familiar with your data

Descriptive statistics

Correlations

Visualization

Histograms

Interactions between features

Solving cases

5. Introducing MLlib

Overview of the package

Loading and transforming the data

Getting to know your data

Descriptive statistics

Correlations

Statistical testing

Creating the final dataset

Creating an RDD of LabeledPoints

Splitting into training and testing

Predicting infant survival

Logistic regression in MLlib

Selecting only the most predictable features

Random forest in MLlib

6. Introducing the ML Package

Overview of the package

Transformer

Estimators

Classification

Regression

Clustering

Pipeline

Predicting the chances of infant survival with ML

Loading the data

Creating transformers

Creating an estimator

Creating a pipeline

Fitting the model

Evaluating the performance of the model

Saving the model

Parameter hyper-tuning

Grid search

Train-validation splitting

Other features of PySpark ML in action

Feature extraction

Discretizing continuous variables

Standardizing continuous variables

Classification

Clustering

Finding clusters in the births dataset

Regression

7. Structured Streaming

What is Spark Streaming?

Why do we need Spark Streaming?

What is the Spark Streaming application data flow?

A quick primer on global aggregations

Introducing Structured Streaming

Analysis of click data

8. Packaging & Deployment of Spark Applications

The spark-submit command

Command line parameters

Deploying the app programmatically

Configuring your SparkSession

Creating SparkSession

Modularizing code

Structure of the module

Building an egg

User defined functions in Spark

Submitting a job

Monitoring execution

Frequently Asked Questions

What are the modes of training for "Big Data Processing with PySpark" course?

This "Big Data Processing with PySpark" course is an instructor-led training (ILT). The trainer travels to your office location and delivers the training within your office premises. If you need training space for the training we can provide a fully-equipped lab with all the required facilities. The online instructor-led training is also available if required. Online training is live and the instructor's screen will be visible and voice will be audible. Participants screen will also be visible and participants can ask queries during the live session.

Will I be provided with any study material during the "Big Data Processing with PySpark" training?

Participants will be provided "Big Data Processing with PySpark"-specific study material. Participants will have lifetime access to all the code and resources needed for this "Big Data Processing with PySpark". Our public GitHub repository and the study material will also be shared with the participants.

What is the pedagogy of zekeLabs?

All the courses from zekeLabs are hands-on courses. The code/document used in the class will be provided to the participants. Cloud-lab and Virtual Machines are provided to every participant during the "Big Data Processing with PySpark" training.

What is the duration of this course?

The "Big Data Processing with PySpark" training varies several factors. Including the prior knowledge of the team on the subject, the objective of the team learning from the program, customization in the course is needed among others. Contact us to know more about "Big Data Processing with PySpark" course duration.

What would be the venue for the "Big Data Processing with PySpark" training?

The "Big Data Processing with PySpark" training is organised at the client's premises. We have delivered and continue to deliver "Big Data Processing with PySpark" training in India, USA, Singapore, Hong Kong, and Indonesia. We also have state-of-art training facilities based on client requirement.

Who is the trainer for "Big Data Processing with PySpark" training?

Our Subject matter experts (SMEs) have more than ten years of industry experience. This ensures that the learning program is a 360-degree holistic knowledge and learning experience. The course program has been designed in close collaboration with the experts working in esteemed organizations such as Google, Microsoft, Amazon, and similar others.

Can we customize this course based on our requirements?

Yes, absolutely. For every training, we conduct a technical call with our Subject Matter Expert (SME) and the technical lead of the team that undergoes training. The course is tailored based on the current expertise of the participants, objectives of the team undergoing the training program and short term and long term objectives of the organisation.

How can I reach out to you if I have any other queries regarding the "Big Data Processing with PySpark" course?

Drop a mail to us at [email protected] or call us at +91 8041690175 and we will get back to you at the earliest for your queries on "Big Data Processing with PySpark" course.

Recommended Courses

	Machine Learning using Tensorflow
	Django
	Mastering Python
	Python for Advanced Learner
	Bootstrap

More Courses

	Grafana
	Pivotal Cloud Foundry
	Oracle Cloud - OCI Foundations
	Dot Net
	Go Programming Language
	Machine Learning using AWS SageMaker
	MySQL
	GitLab
	React Native
	Artificial Intelligence

First Name*
Last Name*
Mobile*
Email*
Training Required For*

Organisation
Message*
Lead Status
Lead Source

Big Data Processing with PySpark Training

Big Data Processing with PySpark Course:

Big Data Processing with PySpark Course Curriculum

1. Understanding Spark

2. Resilient Distributed Datasets

3. DataFrames

4. Prepare Data for Modeling

5. Introducing MLlib

6. Introducing the ML Package

7. Structured Streaming

8. Packaging & Deployment of Spark Applications

Frequently Asked Questions

What are the modes of training for "Big Data Processing with PySpark" course?

Will I be provided with any study material during the "Big Data Processing with PySpark" training?

What is the pedagogy of zekeLabs?

What is the duration of this course?

What would be the venue for the "Big Data Processing with PySpark" training?

Who is the trainer for "Big Data Processing with PySpark" training?

Can we customize this course based on our requirements?

How can I reach out to you if I have any other queries regarding the "Big Data Processing with PySpark" course?

How to deploy Kafka and Zookeeper cluster on Kubernetes

What are the next best Programming Languages?

Know more about Terraform

How do I check end of file (EOF) in python?

Practical use cases of AI in Business

How to install Kubernetes Clusters Using Terraform?

How to get started with Helm on Kubernetes?

Using Terraform with AWS

Using Terraform with Azure

Object Model in Python - Understanding Internals

Recommended Courses

Machine Learning using Tensorflow

Django

Mastering Python

Happy to hear your feedback