Big Data?
PySpark
A term refers to the study and application of datasets that are too complex for traditional data-processing software.
Volume, Variety and Velocity.
Clustered computing.
Parallel computing.
Distributed computing.
Batch processing.
Real-time processing.
Hadoop/ Mapreduce
Apache Spark
Distributed cluster computing framework.
Efficient in-memory computations for large datasets.
Lightning fast data processing framework.
Spark SQL.
MLlib Machine Learning.
Graph X.
Spark Streaming.
RDD API Apache Spark Core.
Local mode.
Cluster mode.
Workflow: Local -> Clusters.
Written in Scala.
Similar to Pandas and Scikit-Learn.
Spark Shell:
Spark-shell for Scale.
PySpark-shell for Python.
SparkR for R.
SparkContext:
An entry point into the world of Spark.
A way of connecting to Spark cluster.
Like a key to the house.
Python: import SparkContext as sc.
Inspecting SparkContext
sc.version
sc.pythonver
sc.master
Loading data in PySpark
rdd = sc.parallelize([1,2,3,4,5])
rdd2 = sc.textFile("test.txt")
Anonymous functions:
Quite efficient with map() and filter().
Create function to be called later like def
Lambda function syntax:
lambda arguments: expression
double = lambda x: x*2
print(double(3)) #6
map() applies a function to all items in the input list.
map(function, list)
items = [1,2,3,4]
list(map(lambda x:x+2, items)) # [3,4,5,6]
filter() takes a function and a list an returns a new list for which the function evaluates as true.
filter(function, list)
items = [1,2,3,4]
list(filter(lambda x:x%2!0, items)) # [1,3]