Introduction to BigData Analysis with Spark

Overview

A term refers to the study and application of datasets that are too complex for traditional data-processing software.
Volume, Variety and Velocity.

Written in Scala.
Similar to Pandas and Scikit-Learn.
Spark Shell:
1. Spark-shell for Scale.
2. PySpark-shell for Python.
3. SparkR for R.
SparkContext:
1. An entry point into the world of Spark.
2. A way of connecting to Spark cluster.
3. Like a key to the house.
4. Python: import SparkContext as sc.
Inspecting SparkContext
1. sc.version
2. sc.pythonver
3. sc.master
Loading data in PySpark
1. rdd = sc.parallelize([1,2,3,4,5])
2. rdd2 = sc.textFile("test.txt")

Anonymous functions:
1. Quite efficient with map() and filter().
2. Create function to be called later like def
Lambda function syntax:
1. lambda arguments: expression
2. double = lambda x: x*2
3. print(double(3)) #6

filter() takes a function and a list an returns a new list for which the function evaluates as true.
filter(function, list)
items = [1,2,3,4]
list(filter(lambda x:x%2!0, items)) # [1,3]

Page updated

Google Sites

Report abuse