Model development in Python

1) Overview:

Clustering for Dataset Exploration
Efficient Iterating
Summary Statistics
Regression
Classification
Introduction to Data Processing
Standardizing Data
Classification and Regression Trees
Fine-tuning Your Model
Visualization with Hierarchical Clustering and t-SNE
Selecting Features for Modeling
Interpreting Unsupervised Learning Models
Correlations and Experimental Design
Putting All Together
Feature Engineering
Data and Pre-processing and Visualization

2) Details:

Unsupervised Learning

Unsupervised Learning finds patterns in data without a specific prediction task in mind.
ex: clustering customers by their purchases.
Compressing the data using purchase patterns (dimension reduction)

K-mean Clustering

Find clusters of samples.
Number of clusters must be specified.
Implemented in sklearn ("scikit-learn")

Regression

Python Code:

from sklearn.linear_model import LinearRegression

reg = LinearRegression()

reg.fit(X, y)

predictions = reg.predict(X)

plt.scatter(X,y)

plt.plot(X, predictions)

plt.show()

How Does It Work?
- y=ax+b
- y is targer and x is feature.
- a and b define an error function for any given line.
- Choose the line that minimizes the error function.
Regularized Regression:
- Technique to avoid overfitting.
- It chooses a coefficient, a, for each feature variable, plus b.
- Large coefficients can lead to overfitting.
- Regularization: Penalize large coefficients.
Ridge regularized regression: Penalizes large positive or negative coefficients.
- α = 0 = OLS (Lead to overfitting).
- High α: underfitting.

5. Lasso regression: Penalizes large positive or negative coefficient.

6. Lasso Regression for Feature Selection:

Lasso can select important features of a dataset.
Shrinks the coefficients of less important features to zero.
Features not shrunk to zero are selected by lasso.

Box-Cox Transformations:

Box-cox function performs power transformations.
It raises each value in the dataset to a power of p.

scipy.stats.boxcox(data, lambda = )

sns.pairplot() #==> plot matrix of distributions and scatterplots.

Box-Cox Transformations:

Box-cox function performs power transformations.
It raises each value in the dataset to a power of p.

scipy.stats.boxcon(data, lambda = )

Standardization:

Transform continuous data to appear normally distributed.
Using non-normal training data can introduct bias.
Log normalization and feature scaling technique.
When:
1. Features are on different scales.
2. Model is linear space (Clustering, KNN, Linear Regression)

Log Normalization:

Useful for features with high variance.
Logarithm transformation
Natural log using the constant (e = 2.718)
df["norm"] = np.log(df["unnorm"])

Decision-Tree

A data structure consisting of a hierarchy of nodes.
Node: question or prediction.
1. Root: no parent node, two children nodes.
2. Internal node: one parent node, two children nodes.
3. Leaf: one parent node, no children nodes => prediction.

Information Gain (IG)

Logistic Regreesion

Plotting The ROC Curve and ROC AUC

Hyper Parameter Tuning:

Ridge/ lasso regression: Choose Alpha.
KNN: Choose n_neighbors.
GridSearchCV:

RandomizedSearchCV:

Hierarchical Clustering with Scipy

Given samples (the array of scores), and country_name.

Feature Selection

Select features to be used for modeling.
When:
1. Reducing noise.
2. Features are strongly statistically correlated.
3. Reduce overall variance.

Removing Redundant Features

Remove noisy features.
Remove correlated features.
- Statistically correlated: features move together directionally.
- Linear models assume feature independence.
- Pearson's correlation coefficient.

df.corr()

Remove duplicated features.

References:

https://app.datacamp.com/learn/fast-tracks/data-scientist-professional/plan?competency=5

Page updated

Google Sites

Report abuse

Model development in Python

About Me: