What is Spark?
Using Spark in Python.
Using Dataframes.
Spark is a platform for cluster computing.
Spread data and computations over clusters with multiple nodes (each acting as a separate computer).
Make it easier to work with very large datasets because each node only works with a small amount of data.
Hep data processing and computation are performed in parallel over the nodes in the cluster => make certain types of programming tasks much faster, but greater complexity.
Deciding whether or not Spark?
Is my data too big to work with on a single machine?
Can my calculations be easily parallelized?
The first step in using Spark is to connect to a remote cluster of computers, where one computer serves as the master and the others as workers.
The master manages data distribution and tasks among the workers.
To establish this connection, you create an instance of the SparkContext class, which can accept optional arguments to define cluster attributes.
These attributes can be configured using the SparkConf() constructor.
Resilient Distributed Dataset (RDD).
A low level object that lets Spark work its magic by splitting data across multiple nodes in the cluster.
Spark DataFrame.
Designed to behave like a SQL table (variables in columns and observations in rows).
Create a SparkSession object from SparkContext.
SparkContext as the connection to the cluster.
SparkSession as your interface with that connection.
# Import SparkSession from pyspark.sql
from pyspark.sql import SparkSession
# Create my_spark
my_spark = SparkSession.builder.getOrCreate()
# Print my_spark
print(my_spark)
# Print the tables in the catalog
print(spark.catalog.listTables())
# Get the first 10 rows of flights
flights10 = spark.sql(query)
# Show the results
flights10.show()
# Get the first 10 rows of flights
flights10 = spark.sql(query)
# Show the results
flights10.show()
# Don't change this query
query = "SELECT origin, dest, COUNT(*) as N FROM flights GROUP BY origin, dest"
# Run the query
flight_counts = spark.sql(query)
# Convert the results to a pandas DataFrame
pd_counts = flight_counts.toPandas()
# Print the head of pd_counts
print(pd_counts.head())
# Create pd_temp
pd_temp = pd.DataFrame(np.random.random(10))
# Create spark_temp from pd_temp
spark_temp = spark.createDataFrame(pd_temp)
# Examine the tables in the catalog
print(spark.catalog.listTables())
# Add spark_temp to the catalog
spark_temp.createOrReplaceTempView("temp")
# Examine the tables in the catalog again
print(spark.catalog.listTables())
# Don't change this file path
file_path = "/usr/local/share/datasets/airports.csv"
# Read in the airports data
# spark = SparkSession()
airports = spark.read.csv(file_path, header=True)
# Show the data
airports.show()