Wednesday, January 22, 2025

Enhance Stability and Reduce Overfitting

Bagging is an ensemble machine learning (ML) technique that improves the consistency of predictive models. This guide describes how bagging works, discusses its advantages, challenges, and applications, and compares it to related techniques like boosting.

Table of contents

What is bagging?

Bagging (or, more formally, bootstrap aggregating) is an ensemble learning technique that improves output accuracy by using multiple similar ML models. At its core, ensemble learning combines several models to achieve better performance than any individual model.

The approach involves splitting the training data into random subsets and training a different model on each. For new inputs, predictions from all models are aggregated to produce a final output. By utilizing randomized subsets, the technique reduces discrepancies among models, resulting in more consistent predictions.

Bagging is particularly effective at improving consistency by minimizing the variance of the ML system.

Variance vs. bias

Reducing bias and variance are fundamental goals of any ML model or system.

Bias describes the errors an ML system makes because of its assumptions about the data it sees. It’s usually determined by calculating how wrong the model is on average. Variance measures model consistency. It’s estimated by checking how different the model’s outputs are for similar inputs.

High bias

As an example, let’s consider the problem of predicting a house’s sale price from its features (such as square footage and number of bedrooms). A simple model may make a lot of simplifying assumptions and only look at square footage, causing it to have a high bias. It will consistently get things wrong, even on the training data, because reality is more complicated than its assumptions. So it’s just unable to pick up on the real price predictors (such as location, school quality, and number of bedrooms).

High variance

A more complex model may pick up on every trend in the training data and have high variance. For example, this model may find a tiny correlation between house number (essentially the numeric part of a street address) and price in the training data and use it, even though it’s not an actual predictor. It will do well on the training data but poorly on real-world data.

The variance-bias tradeoff

An ideal model would have low bias and low variance, generating the correct outputs consistently across similar inputs. High bias usually results from the model being too simple to capture the patterns in the training data—underfitting. High variance usually results from the model capturing spurious patterns in the training data—overfitting.

Increasing a model’s sophistication can allow it to capture more patterns, leading to lower bias. However, this more sophisticated model will tend to overfit the training data, leading to higher variance, and vice versa. In practice, a well-balanced bias-variance trade-off is hard to attain.

Bagging focuses on reducing variance. Each model in the ensemble may have high variance because it overfits its dataset. But since each model gets a randomized dataset, they’ll discover different spurious patterns. In the house price example, one model might overvalue houses with even numbers, another might undervalue them, and most might ignore house numbers entirely.

These arbitrary patterns tend to average out when we average their predictions, leaving us with the true underlying relationships. The ensemble thus achieves lower variance and reduced overfitting compared to any individual model.

Bagging vs. boosting

You may hear bagging talked about in the same context as boosting. These are the most common ensemble learning techniques and underpin many popular ML models. Boosting is a technique where models are trained on the errors of previous models. Then this group of models is used to respond to any inputs. Let’s discuss the differences between the two techniques further.

Bagging Boosting
Model training Models are trained in parallel on different subsets of data Models are trained sequentially, with each model focusing on the errors of the previous model
Error reduction focus Reduces variance Reduces bias
Common algorithms Random forest, bagged decision trees AdaBoost, gradient boosting, XGBoost
Overfitting risk Lower risk of overfitting due to random sampling Higher risk of overfitting
Computational complexity Lower Higher

 

Both techniques are common, though boosting is more popular. Boosting can reduce both bias and variance, while bagging usually only affects variance.

How bagging works

Let’s consider how bagging actually works. The gist is to split the training data randomly, train models in parallel on the split data, and use all the models to respond to inputs. We’ll tackle each in turn.

Data splitting

Assume we have a training dataset with n data points and want to make a bagged ensemble of m models. Then, we need to create m datasets (one for each model), each with n points. If there are more or fewer than n points in each dataset, some models will be over- or under-trained.

To create a single new random dataset, we randomly choose n points from the original training dataset. Importantly, we return the points to the original dataset after each selection. As a result, the new random dataset will have more than one copy of some of the original data points while having zero copies of others. On average, this dataset will consist of 63% unique data points and 37% duplicated data points.

We then repeat this process to create all m datasets. The variation in data point representation helps create diversity among the ensemble models, which is one key to reducing variance overall.

Model training

With our m randomized datasets, we simply train m models, one model to each dataset. We should use the same kind of model throughout to ensure similar biases. We can train the models in parallel, allowing for much quicker iteration.

Aggregating models

Now that we have m trained models, we can use them as an ensemble to respond to any input. Each input data point is fed in parallel to each of the models, and each model responds with its output. Then we aggregate the outputs of the models to arrive at a final answer. If it’s a classification problem, we take the mode of the outputs (the most common output). If it’s a regression problem, we take the average of the outputs.

The key to reducing variance here is that each model is better at some kinds of inputs and worse at others due to differences in training data. However, overall, the errors of any one model should be canceled out by the other models, leading to lower variance.

Types of bagging algorithms

Bagging as an algorithm can be applied to any type of model. In practice, there are two bagged models that are very common: random forests and bagged decision trees. Let’s briefly explore both.

Random forests

A random forest is an ensemble of decision trees, each trained on randomized datasets. A decision tree is a model that makes predictions by answering yes/no questions about input data until it finds a suitable label.

In a random forest, each decision tree has the same hyperparameters—preset configurations like the maximum depth of the tree or the minimum samples per split—but it uses different (chosen at random) features from the training dataset. Without feature randomization, each decision tree may converge to similar answers despite differences in training data. Random forests are an extremely popular choice for ML and are often a good starting point for solving ML tasks.

Bagged decision trees

Bagged decision trees are very similar to random forests except that every tree uses the same features from the training dataset. This reduces the diversity of outputs from the trees, which has pros and cons. On the plus side, the trees are more stable and will likely give similar answers; this can be used to determine which features are important. The downside is that variance won’t be reduced as much. For this reason, random forests are used much more than bagged decision trees.

Applications of bagging

Bagging can be used in any ML problem where the variance is higher than desired. As long as there is an ML model, it can be bagged. To make this more concrete, we’ll review a few examples.

Classification and regression

Classification and regression are two of the core ML problems. A user may want to label the subject of an image as a cat or as a dog—classification. Or a user may want to predict the selling price of a house from its features—regression. Bagging can help reduce variance for both of those, as we saw.

In classification, the mode of the ensemble models is used. In regression, the average is used.

Feature selection

Feature selection is about finding the most important features in a dataset—the ones that best predict the correct output. By removing irrelevant feature data, a model developer can reduce the possibility of overfitting.

Knowing the most important features can also make models more interpretable. Additionally, model developers can use this knowledge to reduce the number of features in the training data, leading to faster training. Bagged decision trees work well to uncover important features. The features that are heavily weighted within them will likely be the important ones.

Bagging in e-commerce

Bagging in e-commerce is particularly valuable for predicting customer churn. ML models trained on churn data often have high variance due to complex, noisy customer behavior patterns; they may overfit their training dataset. They might also infer spurious relationships, such as assuming the number of vowels in a customer’s name affects their likelihood of churn.

The training dataset may contain only a few examples that cause this overfitting. Using bagged models, the ensemble can better identify genuine churn indicators while ignoring spurious correlations, leading to more reliable churn predictions.

Advantages of bagging

Bagging reduces model variance and overfitting and can help with data problems. It’s also one of the most parallelizable and efficient bagging techniques.

Reduced variance

Model variance indicates that a model isn’t learning the true, meaningful patterns in data. Instead, it’s picking up on random correlations that don’t mean much and are a symptom of imperfect training data.

Bagging reduces the variance of the models; the ensemble as a whole focuses on the meaningful relationships between input and output.

Generalize well to new data

Since bagged models are more likely to pick up on meaningful relationships, they can generalize to new or unseen data. Good generalization is the ultimate goal of machine learning, so bagging is often a useful technique for many models.

In almost every ML problem, the training dataset is not fully representative of the actual data, so good generalization is key. In other cases, the true data distribution might change over time, so an adaptable model is necessary. Bagging helps with both cases.

Highly parallelizable

In contrast to boosting, creating bagged models is highly parallelizable. Each model can be trained independently and simultaneously, allowing for rapid experimentation and easier hyperparameter tuning (provided, of course, that you have enough compute resources to train in parallel).

Additionally, since each model is independent of the others, it can be swapped in or out. For example, a weak model can be retrained on a different random subset to improve its performance without touching the other models.

Challenges and limitations of bagging

Unfortunately, adding more models adds more complexity. The challenges of extra complexity mean that bagged models require a lot more compute resources, are harder to interpret and understand, and require more hyperparameter tuning.

More computational resources needed

More models require more resources to run them, and often, bagged ensembles have 50+ models. This may work well for smaller models, but with larger ones, it can become intractable.

Response times for the ensemble can also suffer as it grows. The resources also have an opportunity cost: They may be better used to train a larger, better model.

Harder to interpret

ML models, as a whole, are hard to interpret. Individual decision trees are a bit easier since they show which feature they base decisions on. But when you group a bunch of them together, as in a random forest, the conflicting answers from each tree can be confusing.

Taking the mode or average of predictions doesn’t itself explain why that’s the correct prediction. The wisdom of the crowd, while often right, is hard to understand.

More hyperparameter tuning

With more models, the effects of hyperparameters are magnified. One slight error in the hyperparameters can now affect dozens or hundreds of models. Tuning the same set of hyperparameters requires more time, which can place an even greater burden on limited resources.

Related Articles

Latest Articles