1) Overview:
Clustering for Dataset Exploration
Efficient Iterating
Summary Statistics
Regression
Classification
Introduction to Data Processing
Standardizing Data
Classification and Regression Trees
Fine-tuning Your Model
Visualization with Hierarchical Clustering and t-SNE
Selecting Features for Modeling
Interpreting Unsupervised Learning Models
Correlations and Experimental Design
Putting All Together
Feature Engineering
Data and Pre-processing and Visualization
2) Details:
Unsupervised Learning
Unsupervised Learning finds patterns in data without a specific prediction task in mind.
ex: clustering customers by their purchases.
Compressing the data using purchase patterns (dimension reduction)
K-mean Clustering
Find clusters of samples.
Number of clusters must be specified.
Implemented in sklearn ("scikit-learn")
Regression
Python Code:
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X, y)
predictions = reg.predict(X)
plt.scatter(X,y)
plt.plot(X, predictions)
plt.show()
How Does It Work?
y=ax+b
y is targer and x is feature.
a and b define an error function for any given line.
Choose the line that minimizes the error function.
Regularized Regression:
Technique to avoid overfitting.
It chooses a coefficient, a, for each feature variable, plus b.
Large coefficients can lead to overfitting.
Regularization: Penalize large coefficients.
Ridge regularized regression: Penalizes large positive or negative coefficients.
α = 0 = OLS (Lead to overfitting).
High α: underfitting.
5. Lasso regression: Penalizes large positive or negative coefficient.
6. Lasso Regression for Feature Selection:
Lasso can select important features of a dataset.
Shrinks the coefficients of less important features to zero.
Features not shrunk to zero are selected by lasso.
Box-Cox Transformations:
Box-cox function performs power transformations.
It raises each value in the dataset to a power of p.
scipy.stats.boxcox(data, lambda = )
sns.pairplot() #==> plot matrix of distributions and scatterplots.
Box-Cox Transformations:
Box-cox function performs power transformations.
It raises each value in the dataset to a power of p.
scipy.stats.boxcon(data, lambda = )
Standardization:
Transform continuous data to appear normally distributed.
Using non-normal training data can introduct bias.
Log normalization and feature scaling technique.
When:
Features are on different scales.
Model is linear space (Clustering, KNN, Linear Regression)
Log Normalization:
Useful for features with high variance.
Logarithm transformation
Natural log using the constant (e = 2.718)
df["norm"] = np.log(df["unnorm"])
Decision-Tree
A data structure consisting of a hierarchy of nodes.
Node: question or prediction.
Root: no parent node, two children nodes.
Internal node: one parent node, two children nodes.
Leaf: one parent node, no children nodes => prediction.
Information Gain (IG)
Logistic Regreesion
Plotting The ROC Curve and ROC AUC
Hyper Parameter Tuning:
Ridge/ lasso regression: Choose Alpha.
KNN: Choose n_neighbors.
GridSearchCV:
RandomizedSearchCV:
Hierarchical Clustering with Scipy
Given samples (the array of scores), and country_name.
Feature Selection
Select features to be used for modeling.
When:
Reducing noise.
Features are strongly statistically correlated.
Reduce overall variance.
Removing Redundant Features
Remove noisy features.
Remove correlated features.
Statistically correlated: features move together directionally.
Linear models assume feature independence.
Pearson's correlation coefficient.
df.corr()
Remove duplicated features.