StatQuest Machine Learning Foundations Skill
Apply Josh Starmer's StatQuest methodology to evaluate, compare, and explain machine learning models for any prediction or classification problem using training data, testing data, and the bias-variance tradeoff.
// TL;DR
The StatQuest Machine Learning Foundations Skill is Josh Starmer's systematic methodology for building, evaluating, and comparing machine learning models. It teaches you to split data into training and testing sets, measure model error using the sum of distances, and navigate the bias-variance tradeoff to select the model that actually performs best on unseen data. Use it whenever you need to choose between competing ML approaches, validate a model's real-world performance, or explain why a simpler model can beat a complex one. It works for both prediction (continuous) and classification (categorical) problems.
// When should I use the StatQuest Machine Learning Foundations Skill?
Use this skill whenever you need to build, compare, or explain a machine learning approach for a prediction or classification task. Especially useful when deciding between model complexity options or when someone asks how to validate a model's real-world performance.
// What inputs do I need to apply the StatQuest ML methodology?
- Problem Typerequired
Is the goal to make a Prediction (continuous output, e.g. how fast?) or a Classification (categorical output, e.g. will they like it or not)? - Available Datarequired
The raw dataset or description of what observations and features are available. - Candidate Methods
One or more machine learning methods to compare (e.g. a simple linear fit, a decision tree, a neural network).
// What are the core principles of the StatQuest Machine Learning approach?
Predictions and Classifications
Machine learning is fundamentally about two things: making Predictions (estimating a continuous value for new data) and making Classifications (assigning new data to a category). Every method should be evaluated on how well it does one of these two things.
Training Data vs. Testing Data
The original data used to build or fit a model is called Training Data. A separate, held-out set is called Testing Data. A model must always be evaluated on Testing Data — not the Training Data it was built on — to know if it will work in the real world.
Sum of Distances (Error Measurement)
To compare model performance, measure the distance between each actual (real) value and the predicted value for every point in the Testing Data, then sum those distances. The model with the smaller total sum of distances wins — regardless of how fancy it is.
Bias-Variance Tradeoff
A model that fits Training Data very well but makes poor predictions on Testing Data is suffering from the Bias-Variance Tradeoff (also called overfitting). Fitting Training Data well is not the goal — predicting Testing Data well is.
Fancy Name ≠ Better Performance
Regardless of what method you use, the most important thing is not how impressive or trendy it sounds, but how well it performs on Testing Data. Always let the data decide which method wins.
Decision Trees as Classification
A Decision Tree is a simple, interpretable machine learning method that classifies new data by routing it through a series of yes/no questions. It is built from Training Data and evaluated by how accurately it classifies points in Testing Data.
// How do you apply the StatQuest Machine Learning Foundations Skill step by step?
- 1
Define the problem type
Decide whether the task requires a Prediction (continuous output) or a Classification (categorical label). This determines which methods and error metrics are appropriate. State this explicitly before touching any data.
- 2
Identify and label your Training Data
Take the available raw dataset and designate a portion as Training Data. This is what you will use to build or fit every candidate model. Do not use Testing Data at this stage.
- 3
Identify and label your Testing Data
Hold out a separate portion of the data as Testing Data. These points must not be used during model fitting. If no split method is specified, note that principled sampling methods exist to do this fairly — do not choose arbitrarily without acknowledging the choice.
- 4
Fit each candidate method to the Training Data
Apply each candidate model (e.g. a simple line, a decision tree, a complex curve) to the Training Data. Record how well each method fits the Training Data, but do not use this as your final judgment — fitting Training Data well is not the goal.
- 5
Generate predictions on the Testing Data for each method
For every data point in the Testing Data, use each fitted model to produce a predicted value or predicted class label. Do this independently for each candidate method.
- 6
Calculate the Sum of Distances for each method
For each method, measure the distance (error) between the actual value and the predicted value for every Testing Data point. Sum all distances. For classification problems, count misclassifications. Record the total for each candidate method.
- 7
Select the method with the smallest Sum of Distances
The model with the lowest total error on Testing Data is the winner — regardless of how simple or complex it appears. Watch for the Bias-Variance Tradeoff: a method with a much lower Training Data fit but better Testing Data performance is still the better choice.
- 8
Explain and communicate the result plainly
Describe the winning method in plain language. Use the StatQuest approach: start with a simple, relatable example (even a silly one), explain the structure (e.g. a Decision Tree routes yes/no questions), and emphasize that the method was chosen by data, not by reputation.
// What are real-world examples of the StatQuest ML methodology in action?
A team wants to predict customer churn (will a user cancel their subscription?) using account activity data.
Define the problem as Classification (churns / does not churn). Split the dataset into Training Data and Testing Data. Fit a simple Decision Tree and a more complex model to the Training Data. For each Testing Data customer, generate a churn/no-churn prediction from each model. Count misclassifications (sum of errors) per model. Choose the model with fewer Testing Data errors, even if the complex model fit the Training Data better — avoid the Bias-Variance Tradeoff trap.
A researcher wants to predict how much a plant will grow based on the amount of fertiliser applied.
Define the problem as Prediction (continuous growth value). Designate measured plant-fertiliser pairs as Training Data; hold out additional measurements as Testing Data. Fit a simple straight line and a more complex curve to the Training Data. For each Testing Data plant, compute predicted growth from both models. Calculate the Sum of Distances (actual minus predicted, summed) for both. Select the model with the smaller sum — prioritising Testing Data performance over Training Data fit.
// What mistakes should I avoid when applying the StatQuest ML methodology?
- Judging a model by how well it fits Training Data instead of Testing Data — this is the Bias-Variance Tradeoff trap (overfitting).
- Choosing a model because it has an impressive or trendy name (e.g. 'deep learning', 'neural network') rather than because it performs best on Testing Data.
- Arbitrarily deciding which data points go into Training Data vs. Testing Data without using a principled sampling method.
- Skipping the Testing Data step entirely and assuming Training Data performance predicts real-world performance.
- Ignoring simple methods (like a straight line or a basic Decision Tree) in favour of complexity — simpler models often win on Testing Data.
// What key terms do I need to know for the StatQuest Machine Learning Foundations Skill?
- Training Data
- The original raw data used to build or fit a machine learning model. All models are trained on this data, but it must not be used as the final measure of model quality.
- Testing Data
- A held-out dataset, separate from Training Data, used exclusively to evaluate how well a fitted model makes predictions or classifications on new, unseen data.
- Prediction
- One of the two core outputs of machine learning — estimating a continuous value for a new data point (e.g. how fast will this person run?).
- Classification
- One of the two core outputs of machine learning — assigning a new data point to a discrete category or label (e.g. will this person like the channel or not?).
- Decision Tree
- A simple machine learning method that classifies or predicts by routing data through a series of yes/no questions. Built from Training Data and evaluated on Testing Data.
- Sum of Distances
- The total error of a model on Testing Data, calculated by summing the distances between each actual value and each predicted value. The primary metric for comparing candidate models.
- Bias-Variance Tradeoff
- The phenomenon where a model fits Training Data very well but performs poorly on Testing Data (overfitting). A critical failure mode to watch for when selecting a model.
// FREQUENTLY ASKED QUESTIONS
What is the StatQuest Machine Learning Foundations Skill?
It is a structured methodology based on Josh Starmer's StatQuest teaching approach for evaluating, comparing, and explaining machine learning models. The skill focuses on splitting data into training and testing sets, measuring prediction error via the sum of distances, and using the bias-variance tradeoff to select the model that performs best on unseen data — not the one that looks most impressive.
What is the bias-variance tradeoff in machine learning?
The bias-variance tradeoff is the phenomenon where a model fits training data very well but performs poorly on testing data, also known as overfitting. A high-bias model is too simple and misses patterns; a high-variance model is too complex and memorizes noise. The goal is to find the sweet spot where testing data error is minimized, even if training data fit isn't perfect.
How do I split data into training and testing sets?
Designate a portion of your raw dataset as training data to build your models, and hold out a separate portion as testing data to evaluate them. The testing data must never be used during model fitting. Use a principled sampling method — such as random stratified splitting — rather than arbitrary selection, so your evaluation reflects real-world performance fairly.
How do I compare two machine learning models fairly?
Fit both models to the same training data, then generate predictions on the same held-out testing data. For each model, calculate the sum of distances — the total error between actual and predicted values for every testing data point. The model with the smaller total error wins, regardless of complexity. Never compare models based solely on how well they fit training data.
How does the StatQuest approach compare to just picking the most popular algorithm?
The StatQuest approach lets testing data performance decide the winner, while picking the most popular algorithm relies on reputation or trends. A simple linear model can outperform a neural network on a given dataset. The StatQuest methodology forces you to measure actual error on unseen data, preventing the common mistake of choosing a method because it has a fancy name rather than proven results.
When should I use the StatQuest Machine Learning Foundations Skill?
Use it whenever you need to build, compare, or explain a machine learning approach for any prediction or classification task. It is especially useful when deciding between models of different complexity, validating that a model will work in the real world, or explaining your model choice to non-technical stakeholders in plain language.
What results can I expect after applying the StatQuest methodology?
You can expect a clearly justified model selection backed by testing data performance, not guesswork. You will have quantified error metrics for each candidate model, a transparent rationale for why the winning model was chosen, and a plain-language explanation of how the model works. This reduces the risk of deploying an overfit model that fails on real-world data.
What is the sum of distances in machine learning model evaluation?
The sum of distances is the total error of a model on testing data, calculated by summing the absolute differences between each actual value and each predicted value across all testing data points. For classification problems, it's the count of misclassifications. It serves as the primary metric for comparing candidate models — the lower the sum, the better the model.
What's the difference between prediction and classification in machine learning?
Prediction estimates a continuous numerical value for new data, such as how fast someone will run or how much a plant will grow. Classification assigns new data to a discrete category, such as whether a customer will churn or not. Every machine learning problem falls into one of these two types, and correctly identifying which one you're solving determines your methods and error metrics.
Why do simple models sometimes beat complex models in machine learning?
Simple models often beat complex models because complex models are more prone to overfitting — they memorize noise in the training data rather than learning the true underlying pattern. When evaluated on testing data, the overfit complex model performs worse. The StatQuest methodology explicitly guards against this by always judging models on testing data performance, not training data fit.