Getting Started with Machine Learning Pipelines

{pyspark.ml}

Overview

Machine Learning Pipelines.
Data Types.
Strings and Factors.
Test vs. Train

Details

Machine Learning Pipelines

At the core of the pyspark.ml module are the Transformer and Estimator classes.
1. Transformer classes have a .transform() method that takes a DataFrame and returns a new DataFrame; usually the original one with a new column appended. For example, you might use the class Bucketizer to create discrete bins from a continuous feature or the class PCA to reduce the dimensionality of your dataset using principal component analysis.
2. Estimator classes all implement a .fit() method. These methods also take a DataFrame, but instead of returning another DataFrame they return a model object. This can be something like a StringIndexerModel for including categorical data saved as strings in your models, or a RandomForestModel that uses the random forest algorithm for classification or regression.

Join the DataFrames

# Rename year column

planes = planes.withColumnRenamed("year","plane_year")

# Join the DataFrames

model_data = flights.join(planes, on="tailnum", how="leftouter")

Data Types

Spark only handles numeric data.
All of the columns in your DataFrame must be either integers or decimals (called 'doubles' in Spark).
.cast(): argument to pass variables to the kind of value you want to create. ("integer", "doubles"). Used inside .withColumn().

# Cast the columns to integers

model_data = model_data.withColumn("arr_delay", model_data.arr_delay.cast("integer"))

model_data = model_data.withColumn("air_time", model_data.air_time.cast("integer"))

model_data = model_data.withColumn("month", model_data.month.cast("integer"))

model_data = model_data.withColumn("plane_year", model_data.plane_year.cast("integer"))

Data Types

pyspark.ml.features: create one-hot vectors.
Encoding categorical feature:
1. Create a StringIndexer. Members of this class are Estimators that take a DataFrame with a column of strings and map each unique string to a number. Then, the Estimator returns a Transformer that takes a DataFrame, attaches the mapping to it as metadata, and returns a new DataFrame with a numeric column corresponding to the string column.
2. Encode this numeric column as a one-hot vector using a OneHotEncoder. This works exactly the same way as the StringIndexer by creating an Estimator and then a Transformer. The end result is a column that encodes your categorical feature as a vector that's suitable for machine learning routines!

# Create a StringIndexer

carr_indexer = StringIndexer(inputCol = "carrier", outputCol="carrier_index")

# Create a OneHotEncoder

carr_encoder = OneHotEncoder(inputCol="carrier_index", outputCol="carrier_fact")

Import Pipeline

# Import Pipeline

from pyspark.ml import Pipeline

# Make the pipeline

flights_pipe = Pipeline(stages=[dest_indexer, dest_encoder, carr_indexer, carr_encoder, vec_assembler])

Transform the Data

# Fit and transform the data

piped_data = flights_pipe.fit(model_data).transform(model_data)

Split the Data

# Split the data into training and test sets

training, test = piped_data.randomSplit([.6, .4])

References

https://campus.datacamp.com/courses/introduction-to-pyspark/getting-started-with-machine-learning-pipelines?ex=1

Page updated

Google Sites

Report abuse

Getting Started with Machine Learning Pipelines

Overview

Details

Machine Learning Pipelines

Join the DataFrames

Data Types

Data Types

Import Pipeline

Transform the Data

Split the Data

References

About Me: