Introduction to Nested Learning: A New Paradigm for Continual Learning and Long-Context Understanding

Posted on Dec 19, 2025

Last Updated on Dec 20, 2025

Table of Contents

Introduction

This an attempt by an AI to simplify to digest this paper. This post alone should not be the final reference rather its just one of it, and go back to the original paper to confirm the things. I will continue to update this as I correct the mistakes, if any found. So far its good.

This post provides a structured analysis of the Nested Learning framework, which proposes that machine learning systems should optimize learning across multiple timescales simultaneously rather than treating parameters as updating uniformly. The paper presents both theoretical justification, connecting optimizers, architectures, and learning dynamics through information theory, and practical implementations (the CMS and Hope architectures). We’ll walk through the core problems this addresses, the conceptual origins, the technical contributions, and the current limitations. The goal is to help researchers understand what this framework proposes, why it matters, and where it remains speculative or incomplete.

1. The Problems We Have Now

Standard Deep Learning Pipeline:

Architecture (fixed) + Optimizer (separate) → Train → Freeze → Deploy

Key Limitations:

Artificial Train/Test Divide: Models freeze after deployment and cannot learn from new interactions. In-context learning provides temporary adaptation, you can show examples in the prompt, but the model doesn’t retain this knowledge beyond a single conversation. This is a fundamental asymmetry: humans learn continuously, but today’s models have a hard boundary between training and deployment.
Catastrophic Forgetting: Fine-tuning on new tasks overwrites old knowledge because all parameters update uniformly through the same gradient flow. The model has no mechanism for gradual, staged consolidation, it’s all-or-nothing. Additionally, momentum terms that compress training dynamics are discarded after training ends, so even the “memory” of how to learn is lost. The common workaround is training on both old and new data together, but this doesn’t scale: memory is limited, computational cost doubles, and if you omit old task data, performance collapses.
Limited Computational Depth: Stacking additional layers doesn’t always increase effective computation, Transformers operate as essentially constant-depth circuits for many task classes. This is counterintuitive: we usually assume depth enables more complex computations, but Transformers process all tokens in parallel. Adding layers doesn’t create more sequential reasoning steps, it adds more parallel branches. This fundamental architectural constraint explains why scaling depth has diminishing returns.
Architecture-Optimizer Mismatch: We pick optimizers (Adam, SGD) independently of architectures, ignoring that different architectures generate different gradient distributions.
Binary Memory System: Models have “context window” (short-term) and “parameters” (long-term), nothing in between. Biological brains have memory at multiple timescales.

Recent Attempts:

In-context learning: Helps but limited to context window
Retrieval-augmented generation: External memory, but not integrated into learning
Continual learning methods (EWC, etc.): Band-aids that try to prevent forgetting rather than enabling graceful consolidation

2. Where This Came From

The paper draws from three converging insights:

Neuroscience Motivation:

Brain oscillations (delta, theta, beta, gamma waves) suggest different neural populations operate at different frequencies, each handling different timescales of processing and learning
Memory consolidation happens through distinct stages: the hippocampus rapidly encodes new information, then over hours and days, relevant patterns are slowly transferred to cortex for stable storage
Neuroplasticity reveals that brains use largely uniform, reusable circuits adapted through learning, not fixed specialized regions, suggesting that timescale-based organization might be more fundamental than functional specialization

Mathematical Observation:

The authors noticed that if you write gradient descent as:

$$W_{t+1} = \underset{W}{\text{argmin}} \left[ \langle W \cdot x_t, \nabla L \rangle + \frac{1}{2\eta}\|W - W_t\|^2 \right]$$

This is exactly the same form as how attention/RNNs update their memory:

$$M_{t+1} = \underset{M}{\text{argmin}} \left[ \langle M \cdot k_t, v_t \rangle + \frac{1}{2\eta}\|M - M_t\|^2 \right]$$

Both are associative memories compressing their context via gradient descent, just on different contexts (gradients vs tokens).

Empirical Gap:

Transformers excel at short-context but fail at continual learning. RNNs handle sequences but can’t match Transformer quality. Maybe we need both frequencies simultaneously?

3. The Core Idea

Nested Learning reframes machine learning as a hierarchy of optimization problems at different timescales, rather than a single optimization of a static architecture.

Core Philosophy: Instead of asking “what’s the best architecture?” and “what’s the best optimizer?” separately, NL asks: “How do we design systems where components at different update frequencies naturally compress information from their input flows?”

The key insight: Learning operates at multiple timescales simultaneously, attention parameters update per token, weights update per batch or epoch, and momentum buffers aggregate gradients over longer training windows. Traditional deep learning treats these as independent problems, NL proposes optimizing them jointly as an interdependent system.

Think of it as moving from imperative architecture (fixed computational graphs with learned parameters) to adaptive architecture (systems where the learning mechanism itself is learned and adapts).

3.5 Simple Analogy

Traditional Deep Learning = A library where:

Books (weights) are written once during “training hours”
Readers (inference) can only look up what was written
New information requires closing the library and rewriting books

Nested Learning = A living ecosystem where:

Archives store permanent knowledge (low-frequency parameters)
Active memory holds recent context (high-frequency parameters)
Working memory processes current input (highest-frequency)
Knowledge continuously flows between levels, what’s learned in working memory gradually consolidates into archives
Like how your brain doesn’t have “training” vs “testing” mode, it’s always learning at multiple timescales

The paper uses anterograde amnesia as a useful analogy: Current LLMs cannot form new persistent memories after pre-training, they can access immediate context (the current conversation) or distant pre-training knowledge (facts learned during training), but nothing in between. Critically, they cannot consolidate new experiences into stable, retrievable knowledge that persists across conversations. This mirrors neurological anterograde amnesia, where patients retain immediate working memory and distant past memories but cannot form new long-term memories.

4. Technical Intuition

Frequency as First Principles

At its core, Nested Learning observes that different parameters update at fundamentally different frequencies, and this is both natural and necessary:

$$\text{Update Frequency} = \frac{1}{\text{Data Points Between Updates}}$$

Examples:

Attention logits (highest frequency): Updated per token, $f_{\text{attention}} = \infty$ (never freeze)
Transformer weights (low frequency): Updated per batch/epoch, $f_{\text{weights}} \approx 10^{-5}$ (via SGD)
Momentum buffers (implicit frequency): Remember gradient direction over $\sim$1000 steps
Biological neurons: Some dendrites update per spike (ms), others via protein synthesis (hours)

Compression as Information Flow

A key insight: Each frequency level acts as an information bottleneck for the level below it.

Consider gradient descent on level $i$ with data at frequency $f_i$:

$$W^{(i)}_{t+1} = W^{(i)}_t - \eta_i \nabla_W L(W^{(i)}_t; x_{\text{context}_i})$$

The context $x_{\text{context}_i}$ for level $i$ is the aggregated outputs from level $i+1$:

$$x_{\text{context}_i} = \text{Aggregate}\left( \{x_{i+1, t} \mid t \in [t_0, t_{\text{now}})\} \right)$$

For a Transformer:

Level 0 (token): Individual token representation $x_t \in \mathbb{R}^d$
Level 1 (attention): Weighted aggregate of tokens via attention (lossy: many tokens → one attention output)
Level 2 (weights): Gradient signals (even more lossy: multi-token patterns → parameter updates)

Math form: If level $i$ compresses context of size $|C_{i+1}|$ to size $|C_i|$, then:

$$\text{Compression Ratio} = \frac{|C_{i+1}|}{|C_i|} = \frac{\text{Seq Len} \cdot d}{d_{\text{hidden}}} \approx 10^2 \text{ to } 10^4$$

Bottleneck Theory Perspective

Each level is solving a constrained optimization problem:

$$\min_{W^{(i)}} \mathbb{E}_{x \sim p_{\text{data}}} [L(W^{(i)}, x)] \quad \text{subject to} \quad W^{(i)} \text{ must compress } W^{(i+1)}$$

This is equivalent to a bottleneck problem (information-theoretic):

$$\min_{W^{(i)}} I(W^{(i)}; x) + \beta \cdot L(W^{(i)}, x)$$

where $I(\cdot;\cdot)$ is mutual information and $\beta$ is the Lagrange multiplier (trade-off between compression and task performance).

Why this matters: The paper argues that different $\beta$ values (corresponding to different update frequencies) naturally emerge when you optimize the following trade-off: “Retain just enough information at this timescale to minimize future task loss, discard the rest.” This isn’t assumed, it’s derived from information theory.

Adam Unmasked: Optimizer as Memory

Traditional framing: Adam is an optimizer used during training.

Nested Learning reinterprets it: Adam is actually a learned associative memory compressing gradient statistics.

Formally, Adam maintains:

$$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t \quad \text{(first moment/mean)}$$

$$v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \quad \text{(second moment/variance)}$$

This is exactly the same form as attention memory updates:

$$\text{Memory Update} = (1 - \alpha) \times \text{Old Memory} + \alpha \times \text{New Signal}$$

More precisely, both gradient descent and attention solve the same optimization problem, just operating on different “contexts”:

$$W_{t+1} = \arg\min_W \left[ \langle W \cdot x_t, \nabla L \rangle + \frac{1}{2\eta} ||W - W_t||^2 \right]$$

This is the same convex problem as:

$$M_{t+1} = \arg\min_M \left[ \langle M \cdot k_t, v_t \rangle + \frac{1}{2\eta} ||M - M_t||^2 \right]$$

where $M$ is attention memory, $k_t$ is key, $v_t$ is value. The core operation is identical: compress input (gradient or tokens) into a learned representation (weight or attention output).

Protecting Old Knowledge: The Low-Pass Filter Effect

In standard SGD with weight decay:

$$W_{t+1} = (1 - \lambda \eta) W_t - \eta \nabla L$$

All parameters decay/update uniformly. New task gradients $\nabla L_{\text{new}}$ directly modify old knowledge encoded in $W_t$.

In Nested Learning with $N$ frequency levels:

$$W^{(1)}_{t+1} = W^{(1)}_t - \eta_1 \nabla L_{\text{new}} \quad \text{(high-freq: fast update)}$$

$$W^{(2)}_{t+1} = W^{(2)}_t - \eta_2 \frac{1}{T} \sum_{s=0}^{T} \nabla L_{\text{new}}(W^{(1)}_s) \quad \text{(low-freq: filtered update)}$$

The low-frequency level only sees smoothed, averaged gradients collected over many high-frequency steps. This creates a low-pass filter effect:

Old task knowledge (encoded at low frequencies) remains stable because it only updates on dominant trends
New task details learn at high frequencies first
Catastrophic interference reduced because new information must accumulate evidence across many steps before reaching low-frequency parameters

Cutoff frequency for forgetting: If $T = 1000$ steps between low-freq updates:

$$f_{\text{cutoff}} \approx \frac{\text{learning rate}}{T} \sim 10^{-6} \text{ to } 10^{-7}$$

Frequencies below this cutoff are naturally protected.

5. The Contributions Explained

A. Theoretical Unification (Sections 3-4):

Contribution 1: Formalized Nested Systems of Associative Memories (NSAM)

Any machine learning model = collection of optimization problems at different frequencies
Each “level” has its own context, objective, and update rule
Provides new lens to view existing methods

Contribution 2: Proved optimizers are associative memories

Adam = optimal memory for mapping gradients to their variance (L² regression)
Momentum = value-less memory compressing past gradients
AdaGrad, RMSProp, etc. all fit this framework
Implication: We should design optimizers based on the gradient distribution the architecture generates

Contribution 3: Showed architectures are also nested optimizations

Transformers: non-parametric solution (∞ frequency) + static weights (0 frequency)
Linear attention: parametric solution optimized via Hebbian learning
MLPs with learned initialization = linear attention with in-context learning
Implication: “Hybrid architectures” are just adding more frequency levels to MLPs

B. Novel Algorithms (Sections 4-5):

Contribution 4: Delta Gradient Descent (DGD)

$$W_{t+1} = W_t(I - \eta \cdot x_t x_t^T) - \eta \cdot \nabla L \cdot x_t$$

Unlike standard GD, the update depends on both current input and current state. Captures non-i.i.d. dependencies (crucial for sequences).

Contribution 5: Generalized momentum families

Deep Momentum: Use MLP instead of linear layer to compress gradients (higher capacity)
Delta Momentum: Better memory management via delta-rule update
Multi-scale Momentum (M3): Multiple momentum terms at different frequencies

C. Architectural Innovations (Sections 7-8):

Contribution 6: Continuum Memory System (CMS)

Replaces single MLP block with chain of blocks at different frequencies:

$$y_t = \text{MLP}^{(f_k)}\left( \text{MLP}^{(f_{k-1})}( \dots \text{MLP}^{(f_1)}(x_t) \dots ) \right)$$

where $MLP^{(f_i)}$ updates every $C_i$ steps.

Key insight: If you forget something from high-frequency memory, it’s still in low-frequency memory and can be recovered (loop through time).

Contribution 7: Self-Modifying Titans (Hope architecture)

Most sequence models have static projections: $k_t = W_k \cdot x_t$

Hope makes projections adaptive:

$$k_t = M_k(x_t) \quad \text{where } M_k \text{ updates in-context}$$

$$v_t = M_v(x_t) \quad \text{via learned rules}$$

$$\eta_t = M_{\eta}(x_t) \quad \text{(self-modifying learning rate)}$$

Plus, it generates its own values: $v̂_t = M(v_t)$ (self-referential).

6. Practical Applications

What It Solves:

Continual Learning Without Catastrophic Forgetting
- Tested: Learning new languages sequentially, class-incremental learning
- Result: Hope-3 (3 memory levels) nearly recovers single-task performance
Long-Context Understanding (up to 10M tokens)
- BABILong benchmark: Hope maintains performance where Transformers/GPT-4 fail at 128K-256K
- Needle-in-haystack: Outperforms all attention-free models
Continual Memory Consolidation
- Knowledge in high-frequency layers gradually transfers to low-frequency
- No need to discard model after “end of pre-training”

7. The Limitations

Acknowledged by Authors:

Not a Complete Solution to Forgetting
- Catastrophic forgetting is natural consequence of limited capacity
- CMS reduces but doesn’t eliminate it
- Compression must discard information
Computational Overhead
- Continual parameter updates have cost
- M3 optimizer slower than Adam/Muon (though they show mitigation strategies)
- Tradeoff: Better solutions vs. training speed
Hyperparameter Complexity
- How many levels? What frequencies?
- Chunk sizes for each level
- Knowledge transfer mechanisms between levels
- More design choices = harder to optimize

Unacknowledged/Unclear:

Scalability Questions
- Experiments limited to 1.3B parameters
- Will nested optimization scale to 100B+ models?
- Memory overhead of maintaining multiple frequency states
Theory-Practice Gap
- Elegant theory, but practical design still requires empirical tuning
- No clear principles for “how many levels for task X?”
- Unified framework, but implementation details matter hugely
Initialization Sensitivity
- Meta-learning initial states crucial for performance
- Bad initialization could make levels interfere destructively
- How robust is this to cold-start scenarios?
Comparison Fairness
- Hope adds parameters (CMS levels) vs. baselines
- Some improvements might be from capacity, not frequency design
- Need iso-parameter comparisons
Biological Plausibility
- Backpropagation through levels not biologically realistic
- Local learning rules would be more neuroscience-aligned
- Delta/Hebbian rules are biologically motivated, but full system isn’t

8. What Comes Next

Immediate Extensions:

Architecture-Specific Optimizers
- Since architectures generate gradient distributions, co-design them
- Example: Muon works well for Transformers, might need different approach for State-Space Models
Higher-Order In-Context Learning
- Stack more levels: learn to learn to learn…
- Could enable models to discover new learning algorithms during deployment
Offline Consolidation (mentioned but not implemented)
- Current work: online consolidation (during wakefulness)
- Future: offline replay during “sleep” (analogous to hippocampal replay)
- Could use sharp-wave ripples simulation

Branches This Inspires:

A. Neuroscience-Inspired AI:

Brain oscillations → frequency-based architectures
Memory consolidation → sleep-based training
Neuroplasticity → truly adaptive networks

B. Continual Lifelong Learning:

Models that never stop learning
Personal AI assistants that adapt to individual users
Robots that learn from experience like animals

C. Self-Modifying AI Systems:

Schmidhuber’s Gödel machines (provably optimal self-improvement)
Meta-learning at deployment time
AI that improves its own learning algorithms

D. Theoretical ML:

New analysis of optimization landscapes at multiple scales
Sample complexity bounds for nested optimization
Connections to online learning theory, regret bounds

E. Efficient LLMs:

Replace costly retraining with continual adaptation
Knowledge editing without fine-tuning
Personalization without full model copies

9. Timeline & Current State

Historical Context:

2016-2020: Proto-Ideas

Fast Weight Programmers (Hinton, Ba 2016): Matrix-valued hidden states
Meta-learning (MAML 2017): Outer loop learns initialization
But: Still binary train/test split

2020-2023: Emergent Capabilities

GPT-3 ICL: Models can adapt from context alone
But: Knowledge doesn’t persist, static after training

2023-2024: Test-Time Training

TTT (Sun et al. 2024): Update parameters during inference
Titans, Mamba, State-Space Models: Alternatives to attention
But: Isolated solutions, no unifying theory

2025: Nested Learning (This Paper)

First unified framework connecting all above
Shows they’re instances of same principle at different frequencies

Current State (Dec 2025):

Empirical Validation:

Hope outperforms baselines on continual learning (CLINC, Banking, DBpedia datasets)
Handles 10M tokens in BABILong (Transformers/GPT-4 struggle at 128K-256K)
Language modeling matches Transformers at 1.3B scale
Experiments limited to ~1.3B parameters (hasn’t been tested on 10B+ scale yet)

Adoption:

Academic: Likely to influence continual learning research (provides unified theoretical framework)
Industry: Premature, needs validation on production-scale models and real-world continual learning tasks
Unanswered: Will multi-timescale architectures become standard like attention is?

DeepSeek-V3 (Liu et al. 2024): Mixture-of-Experts with auxiliary losses at different depths (partially related)
Gemini 2.5 (Comanici et al. 2025): Multi-timescale processing (similar motivation but different approach)
RWKV-7 (Peng et al. 2025): Dynamic recurrence with state updates (related ideas, predates paper)

Trend: Industry labs independently exploring multi-timescale and test-time adaptation designs. This suggests the core ideas have merit even if implementations differ.

What’s Missing for Maturity:

Scaling Laws: Need “Chinchilla” equivalent for nested architectures
- Optimal number of levels vs. compute budget?
- Frequency allocation strategies?
Benchmarks: Current evals designed for static models
- Need continual learning benchmarks at scale
- Lifelong evaluation protocols
Engineering:
- Efficient implementations (current code research-grade)
- Framework integration (PyTorch/JAX native support)
- Distributed training for nested systems
Theory:
- Convergence guarantees for nested optimization?
- Sample complexity bounds?
- When does more levels help vs. hurt?

Glossary: Key Terms for Beginners

Foundational Concepts

Optimization Problem: A mathematical formulation of “find the best solution to minimize/maximize a function.” In ML, we minimize loss (error). Example: “minimize prediction error on training data.”

Gradient Descent: Algorithm that iteratively improves parameters by moving in the direction opposite to the gradient (direction of steepest increase in loss). Like walking downhill to minimize height.

Update Frequency: How often parameters change. High-frequency = change often (per token). Low-frequency = change rarely (per epoch). This is the key concept in Nested Learning.

Parameter: A learned value in a neural network. Weights and biases are parameters. They’re what gets updated during training.

Associative Memory (niche concept): A memory system that retrieves stored information based on similarity to query inputs, rather than explicit memory addresses. In biological brains, smells trigger memories of people or places, retrieval is content-based, not address-based. In neural networks, attention mechanisms are a form of associative memory: given a query, retrieve values most similar to that query. Compared to standard memory access (retrieve value at location X), associative memory enables flexible, semantic retrieval.

Specific to Deep Learning

Attention: A mechanism in Transformers that computes weighted sums of values based on query-key similarity. Why? Because it lets the model focus on relevant parts of input. (Beginner confusion: Think of it as “smart averaging”, not all tokens are equally important.)

In-Context Learning (ICL): Using the prompt/context window to teach the model behavior without updating weights. For instance, showing the model three labeled examples in the prompt teaches it a classification task for that single inference pass. The model adapts its behavior based on context, but this adaptation is temporary, once the conversation ends, the learned behavior doesn’t persist. The model returns to its pre-training behavior on the next unrelated query. This distinction between temporary context-based adaptation and permanent weight-based learning is fundamental to understanding why ICL has limitations.

Catastrophic Forgetting (key problem): When a model learns task B, it forgets task A because both tasks use overlapping parameters, and new task gradients overwrite old task knowledge. This happens because gradient descent doesn’t distinguish “important for old task” from “important for new task”, it just optimizes overall loss. The term “catastrophic” emphasizes that performance on task A doesn’t degrade gradually, it collapses suddenly once learning shifts to task B. This is not a minor inconvenience but a fundamental blocker to continual learning in current architectures.

Momentum (optimizer term): An optimizer buffer that maintains an exponential moving average of recent gradients instead of updating based only on the current gradient. Intuitively, momentum “remembers” the direction of recent updates, which smooths optimization and accelerates convergence on well-conditioned objectives. From the Nested Learning perspective, momentum is reframed as a low-frequency memory that encodes gradient statistics, treating the optimizer not as a hyperparameter choice but as a learned memory system.

Fine-tuning: Taking a pre-trained model and adjusting its weights on a new task. Usually causes catastrophic forgetting if you’re not careful.

Information-Theoretic Concepts (niche, explains Section 2.5)

Information Bottleneck: A mathematical principle stating that an intermediate representation should compress input information while preserving task-relevant information. More precisely: compress as much as possible while maintaining predictive power. The intuition is similar to summarizing a 300-page book into one paragraph, you lose many details but ideally preserve the core plot. Information theory formalizes this trade-off: how much information can you discard while still solving the task? This principle explains why all neural networks work: they’re all implicitly doing information bottleneck optimization at different layers.

Mutual Information ($I(X, Y)$): Measures how much knowing X tells you about Y, equivalently, the reduction in uncertainty about Y when you observe X. High mutual information means X and Y are strongly dependent, low MI means they’re nearly independent. In machine learning, mutual information quantifies signal (relevant shared structure) versus noise (independent variation). The Nested Learning framework uses MI to formalize which information each frequency level should retain about previous levels.

Lagrange Multiplier ($\beta$): A weight in optimization that trades off two competing goals. In NL: “compress info” vs. “minimize loss.” High $\beta$ = prioritize compression. Low $\beta$ = prioritize task performance.

Neuroscience Terms Used in NL

Memory Consolidation: Process where short-term memories (hippocampus) gradually transfer to long-term storage (cortex). In NL: this is the conceptual inspiration for information flowing from high-frequency to low-frequency levels.

Brain Oscillations: Rhythmic electrical activity in the brain (delta, theta, beta, gamma waves). Different frequencies are associated with different neural processes. (In NL: Inspires the multi-frequency architecture.)

Hippocampus → Cortex Pathway: Brain’s memory system. Hippocampus learns quickly (fast, detailed), cortex learns slowly (consolidated, generalizable). (In NL: Analogy for high-freq to low-freq information flow.)

Anterograde Amnesia: Inability to form new long-term memories after injury, though short-term recall works. (In NL: Used to describe LLMs, they have immediate context and distant pre-training knowledge, but can’t form new persistent memories.)

Paper-Specific Terminology

Nested Systems of Associative Memories (NSAM): The theoretical framework. Says any ML model is a collection of memories at different timescales all compressing information via the same mathematical operation.

Delta Gradient Descent (DGD): A variant where weight updates depend on current state, not just gradients. Why? Captures dependencies in sequences where past data matters. (Beginner note: Different from standard SGD, which treats each data point independently.)

Continuum Memory System (CMS): Architecture with multiple MLP blocks, each updating at different frequencies. High-freq blocks update every token, low-freq blocks every 1000s of tokens.

Hope Architecture: The main architecture implementation. Makes attention projections (keys, values, learning rates) adaptive instead of static. Can update them in-context.

Hebbian Learning: A biologically-plausible learning rule inspired by neuroscience: “neurons that fire together, wire together.” Mathematically, strengthen synapses between neurons whose activities are correlated. This contrasts with backpropagation, which requires error signals and isn’t easily implemented in biological systems. In the Nested Learning framework, Hebbian-like rules appear naturally as the learned update rules at intermediate frequency levels, suggesting that multi-timescale learning might be more biologically realizable than end-to-end backpropagation.

Common Confusions (read if stuck)

Q: Isn’t Adam already solving the multi-frequency problem? A: Adam operates at a single frequency (updates weights per batch). Nested Learning argues for multiple simultaneous frequencies, some parameters updating per token, others per 1000 tokens, with information flowing between them. Adam handles optimizer variance across the whole model uniformly, NL treats different parameter groups as operating on different timescales with different learning dynamics.

Q: If catastrophic forgetting is natural, doesn’t the brain suffer from it too? A: The brain experiences some interference, but less severely. The key is the consolidation process: new memories form in the hippocampus (fast, detailed), then gradually transfer to cortex (slow, consolidated). During consolidation, old and new information are separated into different physical regions. NL’s low-pass filter mimics this separation: low-frequency parameters encoding old knowledge are insulated from new task gradients because they only update on aggregate trends, not every noisy gradient spike.

Q: Why not just use a much larger model to avoid forgetting? A: Capacity helps but doesn’t fundamentally solve the problem. If all parameters update at the same frequency from task-specific gradients, new task information interferes with old task knowledge regardless of model size. You’re not solving the mechanism of interference, you’re hoping to have enough capacity that interference is unnoticeable. This works in practice for some scenarios but breaks down in true continual learning where you face dozens or hundreds of sequential tasks.

Q: Is this just fine-tuning with regularization methods like EWC or replay buffers? A: No. Those approaches try to prevent forgetting by keeping old tasks alive during new training (through regularization weights or rehearsal). NL takes a different approach: use architectural and timescale-based separation so old and new learning naturally decouple. It’s not about preventing forgetting, it’s about making the system’s design inherently support continual learning at multiple timescales.