1) Overview:
Understand the components that make up the ML pipeline.
Get a high-level overview of the workflow.
Understand the basic steps for implementation.
2) Details:
Problem Formulation
Start with the problem that the company believes can benefit from ML.
Ask some critical questions:
How is the task done today?
How will the business measure success?
How will the solution be used?
Do similar solutions exist, that you might learn from?
What assumptions have been made?
Who are the domain experts?
Framing the Problem
Does ML or the traditional approach make more sense?
Supervised or unsupervised?
Labelled data to train?
Validate the use of machine learning and confirm that you have access to the right people and data.
Data Sources
Private data.
Commercial data.
Open-source data.
Data Considerations
Understand data.
Get a domain expert.
Evaluate data quality.
Identify features and labels.
Identify labeled data needs.
Feature Engineering
Dealing with your data to make it usable.
A process of selecting or creating the features that you will use to train your model.
Feature extraction: building up valuable information from raw data by reformating, combining, and transforming primary features into new ones.
Feature selection: selecting the features that are most relevant and discarding the rest.
Preparing Data
Encoding data.
Cleaning data.
Finding missing data.
Handling outliers.
Overfitting and Underfitting
Overfitting:
The model performs well on training data, but it does not perform well on the evaluation data.
It essentially memorizes the training data instead of actually learning the relationship between features and labels.
Underfitting:
The model performs poorly on the training data.
It cannot capture the relationship between the input examples (often called X) and the target values (often called Y).
Balanced:
Good trade-off between the error on the training data and the evaluation data.
References: