Getting to know PySpark

Overview

What is Spark?
Using Spark in Python.
Using Dataframes.

Details

What is Spark?

Spark is a platform for cluster computing.
1. Spread data and computations over clusters with multiple nodes (each acting as a separate computer).
2. Make it easier to work with very large datasets because each node only works with a small amount of data.
Hep data processing and computation are performed in parallel over the nodes in the cluster => make certain types of programming tasks much faster, but greater complexity.
Deciding whether or not Spark?
1. Is my data too big to work with on a single machine?
2. Can my calculations be easily parallelized?

Spark in Python

The first step in using Spark is to connect to a remote cluster of computers, where one computer serves as the master and the others as workers.
The master manages data distribution and tasks among the workers.
To establish this connection, you create an instance of the SparkContext class, which can accept optional arguments to define cluster attributes.
These attributes can be configured using the SparkConf() constructor.

Spark Data Structure

Resilient Distributed Dataset (RDD).
- A low level object that lets Spark work its magic by splitting data across multiple nodes in the cluster.
Spark DataFrame.
- Designed to behave like a SQL table (variables in columns and observations in rows).

Working with Spark DataFrames

Create a SparkSession object from SparkContext.
1. SparkContext as the connection to the cluster.
2. SparkSession as your interface with that connection.

# Import SparkSession from pyspark.sql

from pyspark.sql import SparkSession

# Create my_spark

my_spark = SparkSession.builder.getOrCreate()

# Print my_spark

print(my_spark)

# Print the tables in the catalog

print(spark.catalog.listTables())

Get and Print First 10 Rows of Flights

# Get the first 10 rows of flights

flights10 = spark.sql(query)

# Show the results

flights10.show()

# Get the first 10 rows of flights

flights10 = spark.sql(query)

# Show the results

flights10.show()

Pandafy a Spark DataFrame

# Don't change this query

query = "SELECT origin, dest, COUNT(*) as N FROM flights GROUP BY origin, dest"

# Run the query

flight_counts = spark.sql(query)

# Convert the results to a pandas DataFrame

pd_counts = flight_counts.toPandas()

# Print the head of pd_counts

print(pd_counts.head())

Put Some Spark in your Data

# Create pd_temp

pd_temp = pd.DataFrame(np.random.random(10))

# Create spark_temp from pd_temp

spark_temp = spark.createDataFrame(pd_temp)

# Examine the tables in the catalog

print(spark.catalog.listTables())

# Add spark_temp to the catalog

spark_temp.createOrReplaceTempView("temp")

# Examine the tables in the catalog again

print(spark.catalog.listTables())

Read a .csv File

# Don't change this file path

file_path = "/usr/local/share/datasets/airports.csv"

# Read in the airports data

# spark = SparkSession()

airports = spark.read.csv(file_path, header=True)

# Show the data

airports.show()

References

Page updated

Google Sites

Report abuse

Getting to know PySpark

Overview

Details

What is Spark?

Spark in Python

Spark Data Structure

Working with Spark DataFrames

Get and Print First 10 Rows of Flights

Pandafy a Spark DataFrame

Put Some Spark in your Data

Read a .csv File

References

About Me: