The classical theory of generalization in statistical learning has largely revolved around complexity measures such as the Vapnik-Chervonenkis (VC) Dimension and Rademacher Complexity. While these tools have been instrumental in providing theoretical guarantees, they often rely on worst-case analyses and may not fully capture the intrinsic structure of real-world learning problems. In this paper, we revisit generalization theory through the lens of Kolmogorov Complexity, a measure of algorithmic information that quantifies the minimal description length of a hypothesis.
Introduction: The Limits of Classical Generalization Theory
Machine learning’s core problem is not training; it’s generalization.
Generalization is the ability of a machine learning model to perform well on unseen data. This is the central challenge of statistical learning. While classical theory provides tools like VC dimension, Rademacher complexity, and norm- or margin-based bounds, these frameworks struggle to explain the behavior of modern deep learning models.
The Deep Learning Paradox
Overparameterized neural networks often generalize well despite having sufficient capacity to memorize their training data. This contradicts classical wisdom, which suggests that models with high complexity (e.g., large VC dimension) should overfit. Empirical phenomena such as double descent, flat minima, and implicit bias of SGD indicate that traditional complexity measures may be missing a fundamental aspect of learning.
A New Lens: Algorithmic Simplicity
What if Generalization is not about the size of the hypothesis class, but the algorithmic simplicity of the learned function?
Kolmogorov Complexity, a measure of the shortest description of an object, offers a fresh perspective:
- Low-K(f) functions are simple in an information-theoretic sense.
- SGD tends to find such functions, even in vast hypothesis spaces.
- Compression techniques (pruning, quantization) often improve generalization, and align with the idea of minimizing description length.
Kolmogorov Complexity: A Primer
The Kolmogorov complexity K(f) of a function f, the length of the shortest program (on a universal Turing machine) that computes f. It captures the algorithmic information content of f, independent of statistical assumptions. In effect, it quantifies the intrinsic algorithmic simplicity of an object.
Key concepts:
- Simplicity Bias: If two or more functions fit the data, the one with lower K(f) is ‘simpler’ and preferred.
- Solomonoff Induction: A universal prior assigns higher probability to simpler functions: 2^{−K(h)}.
- Minimum Description Length (MDL): A computable approximation of Kolmogorov Complexity, used for model selection.
In practice, K(f) is non-computable, but can be approximated by compression metrics or MDL proxies. This connects naturally with deep learning: we can measure how compressible a model is and relate it to its generalization behavior.
Why Kolmogorov Complexity Matters in Deep Learning
The Overparameterization Paradox
Classical theory predicts that large models should overfit. Yet, deep networks generalize well. Kolmogorov complexity resolves this:
- SGD acts as an implicit complexity minimizer, converging to simple functions.
- Empirical studies (e.g., Mingard et al., 2020) show trained networks have low effective complexity, even if their architecture is large.
Unlike VC or Rademacher bounds, which depend on hypothesis class size, Kolmogorov complexity speaks to the specific function learned, not the class it belongs to.
SGD as an Approximate MDL Optimizer
Stochastic Gradient Descent (SGD) exhibits a bias toward simple functions:
- Prefers flat minima, which are more compressible.
- Behaves like an approximate MDL selector, even without explicit regularization.
SGD tends to find simple functions with low Kolmogorov complexity even in large hypothesis spaces. Compression and pruning techniques further reduce model description length without harming generalization, and sometimes even improving it.
Model Compression as a Proxy for Kolmogorov Complexity
While true K(f) is non-computable, model compression offers a practical surrogate:
- Weight Pruning, Quantization, and Huffman Encoding reduce the model size.
- Compression ratios reflect the redundancy and structure within learned weights.
- Models with better compression often generalize better.
This supports the MDL view: Generalization ∼ min {Model Size + Error Encoding}
Challenges & Future Direction
1. The Non-Computability of K(f), and Practical Workarounds
Kolmogorov Complexity K(f) is fundamentally uncomputable. No algorithm can determine the shortest possible description of an arbitrary function f. This stems from Rice’s Theorem and the halting problem, thereby making exact calculation impossible.
This implies two things for machine learning:
- We cannot directly optimize K(f) during training.
- Classical learning theory relies on computable measures (e.g., VC dimension), but these may not reflect true algorithmic simplicity.
So, how do we tackle this? While we cannot compute K(f) exactly, we can approximate it using:
- Compression-based proxies – e.g., include model pruning, quantization, and architectural minimalism.
- Minimum Description Length (MDL): Implemented via Huffman Coding, Bayesian model averaging, or variational compression.
- Algorithmic probability: Leverage Solomonoff induction to assign higher priors to shorter programs, or approximate Bayesian inference with simplicity-biased priors.
Open Challenges:
- How sensitive are these proxies to the choice of encoding scheme?
- Can we develop universal compression metrics for neural networks?
2. The Lack of Tight Generalization Bounds
Classical bounds (VC, Rademacher, PAC-Bayes) depend on hypothesis class complexity, not the specific learned function. This leads to overly pessimistic guarantees for overparameterized models. Kolmogorov Complexity could help because:
- A function’s description length may better predict its generalization than its parameter count.
- Non-uniform learning theory: Instead of uniform convergence, we could derive bounds conditioned on K(f).
This leads to two potential research directions.
-
- Kolmogorov-PAC Bounds: Can we prove statements like:
$$
\mathrm{Generalization\ Gap} \leq O\left( \frac{K(f) + \log\left( \frac{1}{\delta} \right)}{n} \right)
$$
- Hybrid Approaches: Combining K(f) with Rademacher complexity for data-dependent bounds, or integrating with SGLD (Stochastic Gradient Langevin Dynamics) to measure the simplicity of SGD trajectories.
Key Questions:
- Can we derive computable upper bounds for K(f) in neural networks?
- How does implicit regularization (e.g., SGD, dropout) relate to description length minimization?
3. Toward Kolmogorov-Aware Optimization
Standard SGD minimizes empirical risk but does not explicitly favor simple functions. Yet, empirically, it often finds low-K(f) solutions.
Explicit Optimization can be addressed through:
- MDL-Regularized Loss Functions: Augment the loss with a complexity penalty – e.g., use LZMA or GZIP on model weights to estimate description length.
- Compression-Aware Optimizers: Modify SGD to prefer compressible updates through gradient masking (e.g., penalizing updates that increase entropy) or iterative pruning + fine-tuning.
- Neural Network Compression as Training Objective: Directly optimize for compressibility using differentiable compression (e.g., straight-through gradient estimators for quantization) or variational Bayesian compression (learning weight distributions that minimize coding cost).
Open Challenges:
- Differentiable compression: How to backpropagate through compression steps?
- Global versus local simplicity: Should we optimize for layer-wise or whole-model K(f)?
- Connection to flat minima: Are low-K(f) solutions also robust to perturbations?
Conclusion: A New Perspective on Generalization
Kolmogorov complexity reframes generalization through the lens of algorithmic information. Instead of asking how large the hypothesis class is, we ask: how complex is the function the model has chosen?
- Developing better approximations of K(f) for neural networks.
- Deriving non-vacuous generalization bounds using algorithmic information.
- Designing optimization methods that explicitly trade off accuracy and simplicity.
By bridging computability theory, statistical learning, and deep learning practice, we may finally resolve the mystery of why overparameterized models generalize.
Acknowledgment
Information-theoretic foundations of deep learning (Mingard et al., 2021).
Algorithmic Information Theory (Chaitin, 1997).
Modeling by Shortest Data Description (Rissanen, 1978).
Input–output maps are strongly biased towards simple outputs (Dingle et al., 2018).
Deep learning generalizes because the parameter-function map is biased towards simple functions (Valle-Pérez et al., 2018).
PS: 20 – 30% of this paper was written with the help of generative AI.