Your cart is currently empty!
Regularization is a core technique in machine learning used to reduce overfitting and improve model generalization. Among the most popular forms are L1 and L2 regularization. In this post, we focus solely on L1 regularization (also known as Lasso) and aim to build a strong mental model of how and why it leads to sparse models.
The Setup: Polynomial Regression as an Analogy
To keep things intuitive, we’ll use polynomial regression as an analogy. A polynomial model of degree looks like this:
$$
f(x; w) = w_0 + w_1x + w_2x^2 + \dots + w_nx^n
$$
Each term \(x^i\) can be thought of as a “feature”, and the corresponding \(w_i\) as its weight. So the weights become the “features” we aim to learn.
We explored three cases:
- Overfitting: Data generated by a low-degree polynomial (e.g., degree 2), but we fit a higher-degree model (e.g., degree 5). This captures noise.
- Exact Fit: Data generated by a polynomial of the same degree as the model. Ideal scenario.
- Underfitting: Data generated by a higher-degree polynomial than the model can represent. The model cannot capture all patterns.

๐ด Overfitting
Fitted model (red): A degree-15 polynomial: way too flexible
True function (green dashed): A smooth parabola (degree 2)
Training data (blue): Same function + strong noise

๐ฅ Underfitting
Blue dots: The training data clearly follows the complex shape
Green dashed line: The true function (degree-20)
Red line: A degree-3 model trying to fit it
What Happens Without Regularization?
Without regularization, the optimizer is free to assign nonzero values to any weight as long as it slightly reduces the loss. Even unimportant or noisy features get small non-zero weights.
So without any penalty, weights never get exactly zero unless they’re perfectly uncorrelated with the data, which is rare in real-world noisy scenarios.
Enter L1 Regularization
L1 regularization adds a penalty proportional to the absolute value of each weight:
\( J(w) = \text{Loss}(w) + \lambda \sum |w_i| \)
This equation is the heart of L1 regularization. It encourages weights to shrink, but more importantly:
It gives the optimizer a reason to set some weights exactly to zero.
Why L1 Leads to Sparsity: The Intuition
1. Non-smoothness and Subgradients
The L1 term \(|w|\) is not differentiable at 0. Its derivative is:
\( \frac{d}{dw}|w| = \begin{cases} 1 & w > 0 \ -1 & w < 0 \ [-1, 1] & w = 0 \end{cases} \)
That interval \([-1, 1]\) means the gradient can “pause” at zero. If the gain from moving away from zero is not strong enough, the optimizer sticks to \(w = 0\).
This allows the optimal solution to stay at zero if the data doesnโt strongly support that feature.
Visualization: The Rubber Band Metaphor
Think of each weight like a marble on a smooth bowl-shaped curve (the loss), while L1 adds a rubber band pulling the marble to the center (zero).
If the slope of the loss is shallow around zero, the rubber band wins and the marble stays at the center. The weight is zero.

Plot Explanation:
- Dashed Line: The original loss function \((wโ0.3)^2\) with its minimum at \(w=0.3\)
- Dotted Line: The L1 penalty \(\lambda |w|\), which pulls all weights toward zero (with slope ยฑฮป)
- Red Line: The total objective:
$$
(w – 0.3)^2 + \lambda |w|
$$
This is what the optimizer actually minimizes.
What this shows:
- Without regularization, the optimizer would settle at \(w = 0.3\)
- With L1 regularization, the minimum of the red curve is pulled toward 0
- If \(\lambda\) were even larger, the global minimum could shift to exactly zero, because the gain from nonzero \(w\) isn’t worth the added L1 penalty
This is how L1 prefers zero when it can. If the original loss has a shallow slope around zero, L1 pushes the total cost’s minimum right to the origin. We could say it “sticks at zero.”
Summary
- Without regularization: all weights can become non-zero, even for unimportant features.
- L1 regularization adds a sharp penalty that promotes exact zeros.
- The reason it works: the math of subgradients, and the geometry of the optimization problem.
This makes L1 regularization great for feature selection and that’s why it’s widely used in sparse models, such as in high-dimensional tabular data or signal processing.