Project Melinoe | Temporal Hierarchies of Learning and Memory

01

The Problem of Temporal Continuity

Contemporary AI systems couple powerful short-horizon inference to weak or absent mechanisms for learning, selecting, and consolidating information across timescales.

The Perpetual Present

A child who encounters a new object can remember it later, integrate it into play, and build on it across days and months. By contrast, most deployed AI systems either remain largely unchanged by their experiences, or change in ways that are unstable and non-selective.

Current systems behave as if they have abundant semantic knowledge but impoverished ongoing learning. Their competence is real, but it is "front-loaded" into pretraining. After training, behavior is dominated by short-horizon inference over a limited context window.

Short-horizon Inference Powerful

Selective Encoding Absent

Long-term Consolidation Absent

Online Learning Frozen

The imbalance in current architectures: fast processing is powerful, but encoding and consolidation are underdeveloped.

"The function of memory is to carry information forward in time."
— Gallistel & King, 2009

02

The Foundation: Transformers

The architecture that revolutionized AI—and its fundamental limitation.

Self-Attention Mechanism

Transformers process input through self-attention: every token (a word or sub-word unit) computes a similarity score with every other token to determine what is relevant. Think of it as each word in a sentence asking every other word, "How important are you to my meaning?"

The mechanism works through three learned projections of the input x:

Q = xW_Q, K = xW_K, V = xW_V

Queries represent what each token is looking for. Keys represent what each token offers. Values carry the actual information to be passed along. Attention computes a weighted combination—scores from query-key similarity are converted to probabilities (via softmax, a function that normalizes scores so they sum to 1), then used to mix the values:

Attention(Q,K,V) = softmax(QK^T / √d_k) V

This gives Transformers their expressive power—any token can attend to any other—but at a cost that grows quadratically with sequence length.

The Feed-Forward Network

After attention, each token passes through an MLP (multi-layer perceptron)—a small neural network of stacked layers that transform data through simple mathematical operations. From the Nested Learning perspective, these layers act as persistent memory: patterns learned during training that never change once the model is deployed.

The Frozen-Knowledge Problem

Attention serves as short-term memory: precise but fleeting, limited to what fits in the current context window. The MLP layers serve as long-term memory: stable but completely frozen after training. There is no mechanism between these two extremes—nothing for ongoing learning, selective encoding, or gradual consolidation of new experiences.

Interactive Attention Matrix

Hover over tokens to see attention patterns

Attention Weight Heatmap

Q K V

Projections

Linear transformations map input tokens into query, key, and value spaces. Multi-head attention uses several parallel sets for diverse attention patterns—like reading a sentence for grammar, meaning, and context simultaneously.

softmax

Attention Weights

Softmax converts raw similarity scores into a probability distribution. Each token effectively "chooses" how much to attend to each other token, enabling flexible long-range dependencies.

FFN

Feed-Forward (MLP)

Per-token neural network layers act as static knowledge stores. They encode everything the model learned during training—but cannot learn anything new once deployed.

03

Titans: Learning to Memorize at Test Time

A neural long-term memory module that selectively encodes information during deployment using surprise-gated updates.

The Core Idea: Surprise as a Write Signal

Titans introduces a neural memory module M that updates during inference—that is, while the model is actually being used, not just during training. The core idea draws on a familiar psychological principle: events that violate expectations—that are surprising—are more memorable.

Surprise is measured by an error signal: how much did the memory's prediction differ from what actually arrived? Formally, this is the gradient (the direction and magnitude of the prediction error) of an associative memory loss:

ℓ(M_t-1; x_t) = ||M_t-1(k_t) - v_t||²

In plain terms: the memory tries to predict the value associated with the current input. When it fails badly, the error is large—the event is surprising—and the memory writes it in.

The memory update incorporates both momentary surprise (the current error signal) and past surprise (a running average called momentum that tracks recent trends), along with a built-in forgetting mechanism (weight decay):

M_t = (1 - α_t)M_t-1 + S_t
S_t = η_t S_t-1 - θ_t ∇ℓ(M_t-1; x_t)

This is equivalent to an optimization algorithm with momentum and weight decay—the memory literally learns by optimizing an associative objective in real time.

Deep Memory Architecture

Unlike simpler models that compress history into a single matrix, Titans uses an MLP (a small neural network) with two or more layers as its memory. This "deep" memory can capture nonlinear relationships—complex patterns that depend on combinations of features—making it strictly more expressive than shallow alternatives.

Three Memory Functions

Selection: Only prediction-violating inputs trigger memory writes—echoing the psychological principle that surprising events are preferentially encoded
Plasticity on demand: The memory's parameters continue changing during deployment, unlike frozen pretrained weights
Capacity management: Gradual forgetting (weight decay) forces continual replacement, preventing the memory from becoming saturated

Surprise-Gated Memory

Surprising tokens (red) are written to the neural memory

Titans Architecture Variants

Three Ways to Incorporate Memory

MAC Memory as Context

Memory output is concatenated with the current input, then processed by standard attention. Attention decides what from long-term memory is relevant to the current moment.

MAG Memory as Gate

Sliding-window attention and neural memory operate in parallel, combined through a learned gate that controls how much each contributes. Attention provides short-term precision; memory handles long-range context.

MAL Memory as Layer

Neural memory processes input before attention sees it, compressing past context into a form attention can use more effectively. The most common hybrid design.

04

Nested Learning & HOPE

All components of a neural network—attention, the training algorithm, and model parameters—are associative memories operating at different update frequencies.

The Key Insight

Nested Learning reveals a unifying principle: every component of a neural network is an associative memory—a system that maps inputs (keys) to outputs (values), compressing context into its parameters. This mirrors the concept of associative memory studied in psychology, but applied to every level of the architecture. What differentiates components is primarily their update frequency:

Attention Refreshes every token

Training algorithm state Updates every batch

Model weights Updates during training

Persistent memory (FFN) Frozen after training

The Training Algorithm as a Memory

The training process itself is an associative memory. When a neural network learns through gradient descent—repeatedly adjusting parameters in the direction that reduces prediction errors—the momentum term (a running average of recent error signals) acts as a memory that compresses gradient history. With a small modification, this "optimizer memory" can be made deeper and more expressive, using its own small neural network to track patterns in how the model should be adjusting.

Continuum Memory System (CMS)

Instead of the single frozen MLP found in standard transformers, HOPE replaces it with a chain of MLP blocks, each updating at a different frequency. This creates a continuum of memory timescales: some layers adapt rapidly (absorbing recent patterns), while others change slowly (encoding stable regularities):

y_t = MLP^(slow)( … MLP^(medium)( MLP^(fast)(x_t) ) )

Information persists across a spectrum rather than collapsing into a binary of "instant" versus "permanent"—each layer encodes knowledge at its characteristic timescale.

Multi-Frequency Memory Hierarchy

Each ring represents a different update frequency

∇

Deep Momentum

Replacing the simple running average with a small neural network yields "Deep Momentum"—the training algorithm itself gains a learned, nonlinear memory for tracking how the model should adjust, enabling richer optimization dynamics.

δ

Error-Corrective Learning

Using error-corrective updates (reminiscent of the Rescorla-Wagner learning rule from psychology) for the memory allows the system to selectively erase outdated associations before writing new ones, managing limited capacity more effectively.

HOPE

Self-Modifying Architecture

HOPE combines a self-referential Titans-based sequence model with the Continuum Memory System, producing an architecture that can learn to modify its own learning process at test time—a form of meta-learning.

05

Dreamer: World Models & Imagination

Learning behaviors by practicing inside a learned simulation of the world—training entirely in imagination.

DreamerV3: Learning a World Model

Dreamer learns a world model: an internal simulation of how the environment behaves and changes in response to actions. Rather than learning purely from trial and error in the real world, the agent can "imagine" what would happen and learn from those imagined experiences. The model operates on latent representations—compact internal codes that capture the essential features of the environment:

Encoder Compresses raw sensory input (images, sounds) into compact internal codes

Sequence Model Maintains the agent's sense of temporal context—what has happened so far

Dynamics Predicts the next state without needing an observation—pure imagination

Reward / Continue Estimates expected rewards and whether the episode is ongoing

Decoder Reconstructs observations from internal codes—provides a training signal

Imagination Training

The actor (decision-maker) and critic (evaluator) learn entirely from imagined experience. Starting from real states, the agent "dreams" forward using the dynamics predictor, collecting predicted rewards and refining its policy without any actual environment interaction. DreamerV3 was the first algorithm to collect diamonds in Minecraft from scratch, without human demonstrations—purely through imagination.

Robust Across Domains

Dreamer employs mathematical techniques for stable learning: predictions are made in a compressed scale that handles extreme values, returns are normalized relative to recent experience, and the world model is prevented from collapsing into trivially simple predictions. A single set of settings works across over 150 diverse tasks—from Atari games to robotics.

World Model & Imagination Training

Click to toggle between Wake and Sleep phases

DreamerV4: Scalable Transformer World Model

DreamerV4: Scaling Up

DreamerV4 replaces the recurrent model with an efficient transformer architecture, enabling world models with billions of parameters. Key innovations:

Causal Tokenizer: Compresses video frames into continuous codes while preserving temporal order, so the model can process visual streams efficiently.
Shortcut Forcing: Generates high-quality predictions in only 4 forward passes instead of 64, enabling real-time interactive inference at 21 frames per second.
Clean Prediction: Predicts complete representations directly rather than incremental changes, eliminating error accumulation over long horizons.
Efficient Architecture: Separates spatial processing ("what's in the scene?") from temporal processing ("how does it change?") into alternating layers, achieving high capacity with fast inference.

DreamerV4 is the first agent to obtain diamonds in Minecraft purely from pre-recorded data, without any environment interaction.

06

The Convergence Thesis

Four frameworks developed independently converge on a shared architectural principle: cognition is the coordinated operation of memory processes distributed across timescales.

DoGMA

Conceptual Vocabulary

Buckner's Domain General Modular Architecture draws on faculty psychology—the philosophical tradition that the mind consists of distinct capacities or "faculties"—to argue that cognition is organized into functional subsystems operating at different retention horizons. Provides the interpretive lens connecting mechanisms to a coherent story about faculties and timescales.

Dreamer

Offline Consolidation

The world model provides counterfactual rollouts and a natural locus for "sleep" phases. Imagination training is the computational analogue of an empiricist "mental laboratory"—recombining learned regularities to evaluate possible futures and consolidate experience into policy.

Titans

Selective Encoding

Surprise-gated memory provides a principled write signal and capacity management through momentum and forgetting. Determines what experiences are promoted into longer-lived stores—the missing selection mechanism in current "long context" approaches.

Nested Learning / HOPE

Multi-Frequency Substrate

The Continuum Memory System provides a spectrum of update rates so information can persist and stabilize without requiring full retraining or remaining trapped in fleeting attention. The temporal substrate that connects fast and frozen memory.

Three Linked Claims

1

Memory beyond a fast/frozen binary is architecturally necessary

Systems with only fast contextual state (attention) and frozen knowledge (pretrained weights) will systematically struggle with experience accumulation. Robust experiential learning requires intermediate memory horizons.

2

Learning requires dynamically interacting memory processes

Effective memory relies on well-implemented selection (what is written), stabilization (what persists), transformation (what becomes reusable), and controlled reuse (how stored structure guides action). These must interact and modulate one another.

3

Post-scaling progress is driven by domain-general memory mechanisms

The most successful proposals do not hard-code domain concepts; they improve the general machinery of retaining and consolidating experience. This aligns with an empiricist commitment to domain-general architectural support for learning.

07

The Synthesis: Project Melinoe

A constructive test of whether coordinating memory operations across timescales yields stable post-deployment learning in an embodied agent.

Unified Temporal Architecture

Melinoe assigns each framework a clear role within an integrated architecture that coordinates three temporal operations:

i

Fast Contextual Inference

Sliding-window attention handles immediate perception and action selection within the current context. This is the agent's "working memory"—precise but ephemeral.

ii

Selective Experience Encoding

Titans' surprise-gated neural memory writes only prediction-violating experiences into persistent storage. Momentum tracks significance over time; weight decay manages capacity.

iii

Slow Consolidation (Sleep)

Periodic sleep phases use Dreamer's world model to replay and recombine experiences. The Continuum Memory System stabilizes information across multiple frequency bands, from session-level to task-level knowledge.

Minimum Viable Evaluation

The architecture is evaluated on a Rock-a-Stack toy in a virtual reality environment—recreating toddler-like learning conditions with no pretraining, only hands-on play. A human proctor specifies success criteria, then the agent learns episodically: interacting, retaining salient experience, and periodically entering sleep-like consolidation phases.

The key prediction is transfer after learning: consolidated representations should support faster, lower-error performance on novel configurations—genuine consolidation, not mere context extension.

Melinoe: Temporal Memory Architecture

Information flows through timescales

Agent Lifecycle

Wake: Act & Encode

Agent interacts with the environment. Attention processes immediate context. Surprising events are written to neural memory via surprise signals.

→

Sleep: Consolidate

Agent disconnects from environment. World model replays and imagines trajectories. CMS layers absorb stabilized knowledge. Policy improves from dreamed experience.

→

Transfer: Generalize

Consolidated memory enables generalization to novel configurations. The agent performs better on new tasks without catastrophic drift or full retraining.

08

Testing the Theory: A Proof of Concept

Without the compute to train a full model from scratch, a hybrid approach tests the core mechanisms by surgically inserting HOPE modules into an existing pretrained model.

The Compute Constraint

Training a Transformer from scratch requires enormous computational resources. A full implementation of the Melinoe architecture would replace the standard MLP and attention blocks entirely with HOPE modules, training everything end-to-end.

Without access to that scale of compute, a different strategy is needed: take an existing pretrained model and modify it to test whether the proposed memory mechanisms actually work.

The Hybrid Solution

The proof of concept uses TinyLlama (a 1.1 billion parameter language model) as a frozen backbone. Its original weights are completely locked—they never change. Instead, small HOPE adapter modules are inserted at three strategic layers (5, 11, and 17 out of 22 total), positioned to intercept information at early, middle, and late processing stages.

Each adapter contains two components working together:

Titan Memory A small neural network that updates in real time using surprise-gated learning. It stores associations that violated expectations—the "interesting" information.

CMS Chain A chain of MLP blocks updating at different frequencies, providing the multi-timescale memory continuum described by Nested Learning.

The adapter output is added to the backbone's hidden state with a learned scaling factor, initially set very small (0.01) so the adapters must gradually earn influence over the model's predictions.

The Teaching Signal

When the model makes a prediction, the error signal (how wrong was the prediction?) flows backward through the network. At each adapter location, this error signal is captured and used as a teach signal for the Titan memory—telling it what information should be stored or updated. This mirrors how surprise drives memory encoding in the Titans framework.

Hybrid Adapter Architecture

Frozen backbone with HOPE modules at layers 5, 11, 17

Evaluation: Why These Benchmarks?

NIAH Needle-in-a-Haystack

A synthetic test that buries a specific piece of information (the "needle") at a controlled depth within a long stretch of irrelevant text (the "haystack"). The model must retrieve the needle when asked.

This tests selective retrieval—the same cognitive capacity Titans' surprise-gated memory is designed to support. By varying context length (how much hay?) and depth (where is the needle?), the test maps exactly where memory begins to fail.

Passkey Exact Retrieval

The model is shown a random 6-digit passkey within a long context, then asked to reproduce it exactly. Unlike NIAH, there is no semantic content to rely on—the passkey is arbitrary digits.

This tests pure memorization capacity—whether the memory system can faithfully store and retrieve precise information that has no prior association. A strict test of the memory's fidelity, not its ability to pattern-match.

09

Results: Evidence of Mechanism Viability

Two iterations of the hybrid adapter demonstrate measurable improvements in memory-dependent tasks, validating the core mechanisms even under severe compute constraints.

Version 1: Initial Validation

The first hybrid adapter was evaluated on NIAH across three context lengths (512, 1024, 2048 tokens) and three insertion depths (beginning, middle, end). Four conditions were compared: the unmodified base model, the frozen backbone with CMS-only adapters (no Titan memory), HOPE without test-time memorization, and HOPE with active memory updates.

V1 NIAH Results: 2048 Context Length

Accuracy by insertion depth

Key Result

At 2048 tokens with the needle at the midpoint, HOPE achieves 100% retrieval accuracy where the unmodified base model manages only 60%. CMS alone reaches 80%—the Titan memory component accounts for the remaining gap.

Limitation

At 2048 tokens with the needle at the very beginning (depth 0.0), all models struggle. The base model drops to 0%, and even HOPE only reaches 20%. This suggests the adapter cannot yet recover information that has been entirely displaced from the backbone's attention window.

Ceiling

At 512 and 1024 tokens, all conditions—including the base model—achieve near-perfect accuracy. These lengths fall within TinyLlama's trained context window, so the backbone handles them natively.

Version 2: Architectural Refinements

Version 1 demonstrated that the memory mechanism works within the backbone's native context window, but collapsed at longer sequences. Version 2 targets this limitation through five changes:

1

Position Extrapolation

The backbone's positional encoding (RoPE) was designed for sequences up to 2048 tokens. By applying scaling and interpolation to these encodings, the backbone can process longer sequences without the position-dependent collapse seen in v1. This is the single most impactful change—without it, the backbone's attention mechanism produces meaningless outputs beyond its trained window.

2

Adaptive Surprise Thresholding

In v1, the surprise gate parameters (α, θ, η) were fixed constants. In v2, the surprise magnitude itself serves as an adaptive gate—when the memory's prediction error is large, the update is proportionally stronger. This is more faithful to the Titans paper's intent: surprise intensity naturally modulates how aggressively the memory writes, without requiring a separate learned gate network.

3

Persistent Memory Tokens

Learnable prefix tokens are added to stabilize the model's initial attention behavior and provide a fixed reference point for retrieval. These function like a "scratch pad" that the model learns to use for storing recurring patterns.

4

Key/Value Projection Alignment

The teach signal is now projected through learned key and value mappings before updating the Titan memory, better matching the associative memory formulation from the Titans paper. This ensures the memory stores information in a format optimized for later retrieval.

Version 2 Results: Extended Context

V2 NIAH: Beyond 2048 Tokens

Accuracy at depth=0.5 across context lengths

Passkey Retrieval Accuracy

Exact 6-digit reproduction at 2048 and 4096 tokens

Breakthrough

At 3072 tokens, where v1 collapses to 15% (worse than the base model's 20%), v2 achieves 82% accuracy. At 4096 tokens—double the backbone's trained context window—v2 reaches 64% where v1 scored 0%. The position extrapolation fix unlocked the adapter's ability to operate beyond the backbone's native range.

Passkey

Passkey retrieval at 4096 tokens shows the same pattern: v2 achieves 62% exact reproduction where both the base model and v1 score 0%. At 2048 tokens, v2 matches CMS performance at 98%—near-perfect memorization within the trained window.

Beyond the Hybrid: Full Implementation Expectations

The hybrid adapter demonstrates that surprise-gated memory and multi-frequency update schedules produce measurable improvements in retrieval tasks, even when grafted onto a frozen backbone that was never designed for them. But this approach faces fundamental constraints:

Hybrid Constraints

The backbone's frozen attention patterns were learned without any memory mechanism—they cannot adapt to use the adapter's output optimally
Position extrapolation is a patch, not a solution—performance degrades predictably beyond 2× the trained window
The adapter adds memory as a parallel residual stream; in a full implementation, memory would be integrated into the core computation

Full HOPE Predictions

End-to-end training would allow attention patterns to co-evolve with the memory system, learning when and how to query stored information
The CMS continuum would span the entire network depth, not just three injection points, enabling true multi-timescale consolidation
Context length would scale with compute rather than hitting a hard wall at the backbone's training window
Surprise thresholds would be learned jointly with the rest of the model, producing more calibrated encoding decisions

These results are a lower bound. Every improvement the hybrid achieves despite its constraints suggests a larger effect from first-principles integration.