A Unified Cognitive Architecture
Multimodal Embodied Learning Interactive Neural Operant Engine
Temporal Hierarchies of Learning & Memory for Embodied Artificial Agents
Contemporary AI systems couple powerful short-horizon inference to weak or absent mechanisms for learning, selecting, and consolidating information across timescales.
A child who encounters a new object can remember it later, integrate it into play, and build on it across days and months. By contrast, most deployed AI systems either remain largely unchanged by their experiences, or change in ways that are unstable and non-selective.
Current systems behave as if they have abundant semantic knowledge but impoverished ongoing learning. Their competence is real, but it is "front-loaded" into pretraining. After training, behavior is dominated by short-horizon inference over a limited context window.
The imbalance in current architectures: fast processing is powerful, but encoding and consolidation are underdeveloped.
"The function of memory is to carry information forward in time."
— Gallistel & King, 2009
The architecture that revolutionized AI—and its fundamental limitation.
Transformers process input through self-attention: every token (a word or sub-word unit) computes a similarity score with every other token to determine what is relevant. Think of it as each word in a sentence asking every other word, "How important are you to my meaning?"
The mechanism works through three learned projections of the input x:
Queries represent what each token is looking for. Keys represent what each token offers. Values carry the actual information to be passed along. Attention computes a weighted combination—scores from query-key similarity are converted to probabilities (via softmax, a function that normalizes scores so they sum to 1), then used to mix the values:
This gives Transformers their expressive power—any token can attend to any other—but at a cost that grows quadratically with sequence length.
After attention, each token passes through an MLP (multi-layer perceptron)—a small neural network of stacked layers that transform data through simple mathematical operations. From the Nested Learning perspective, these layers act as persistent memory: patterns learned during training that never change once the model is deployed.
Attention serves as short-term memory: precise but fleeting, limited to what fits in the current context window. The MLP layers serve as long-term memory: stable but completely frozen after training. There is no mechanism between these two extremes—nothing for ongoing learning, selective encoding, or gradual consolidation of new experiences.
Linear transformations map input tokens into query, key, and value spaces. Multi-head attention uses several parallel sets for diverse attention patterns—like reading a sentence for grammar, meaning, and context simultaneously.
Softmax converts raw similarity scores into a probability distribution. Each token effectively "chooses" how much to attend to each other token, enabling flexible long-range dependencies.
Per-token neural network layers act as static knowledge stores. They encode everything the model learned during training—but cannot learn anything new once deployed.
A neural long-term memory module that selectively encodes information during deployment using surprise-gated updates.
Titans introduces a neural memory module M that updates during inference—that is, while the model is actually being used, not just during training. The core idea draws on a familiar psychological principle: events that violate expectations—that are surprising—are more memorable.
Surprise is measured by an error signal: how much did the memory's prediction differ from what actually arrived? Formally, this is the gradient (the direction and magnitude of the prediction error) of an associative memory loss:
In plain terms: the memory tries to predict the value associated with the current input. When it fails badly, the error is large—the event is surprising—and the memory writes it in.
The memory update incorporates both momentary surprise (the current error signal) and past surprise (a running average called momentum that tracks recent trends), along with a built-in forgetting mechanism (weight decay):
This is equivalent to an optimization algorithm with momentum and weight decay—the memory literally learns by optimizing an associative objective in real time.
Unlike simpler models that compress history into a single matrix, Titans uses an MLP (a small neural network) with two or more layers as its memory. This "deep" memory can capture nonlinear relationships—complex patterns that depend on combinations of features—making it strictly more expressive than shallow alternatives.
Memory output is concatenated with the current input, then processed by standard attention. Attention decides what from long-term memory is relevant to the current moment.
Sliding-window attention and neural memory operate in parallel, combined through a learned gate that controls how much each contributes. Attention provides short-term precision; memory handles long-range context.
Neural memory processes input before attention sees it, compressing past context into a form attention can use more effectively. The most common hybrid design.
All components of a neural network—attention, the training algorithm, and model parameters—are associative memories operating at different update frequencies.
Nested Learning reveals a unifying principle: every component of a neural network is an associative memory—a system that maps inputs (keys) to outputs (values), compressing context into its parameters. This mirrors the concept of associative memory studied in psychology, but applied to every level of the architecture. What differentiates components is primarily their update frequency:
The training process itself is an associative memory. When a neural network learns through gradient descent—repeatedly adjusting parameters in the direction that reduces prediction errors—the momentum term (a running average of recent error signals) acts as a memory that compresses gradient history. With a small modification, this "optimizer memory" can be made deeper and more expressive, using its own small neural network to track patterns in how the model should be adjusting.
Instead of the single frozen MLP found in standard transformers, HOPE replaces it with a chain of MLP blocks, each updating at a different frequency. This creates a continuum of memory timescales: some layers adapt rapidly (absorbing recent patterns), while others change slowly (encoding stable regularities):
Information persists across a spectrum rather than collapsing into a binary of "instant" versus "permanent"—each layer encodes knowledge at its characteristic timescale.
Replacing the simple running average with a small neural network yields "Deep Momentum"—the training algorithm itself gains a learned, nonlinear memory for tracking how the model should adjust, enabling richer optimization dynamics.
Using error-corrective updates (reminiscent of the Rescorla-Wagner learning rule from psychology) for the memory allows the system to selectively erase outdated associations before writing new ones, managing limited capacity more effectively.
HOPE combines a self-referential Titans-based sequence model with the Continuum Memory System, producing an architecture that can learn to modify its own learning process at test time—a form of meta-learning.
Learning behaviors by practicing inside a learned simulation of the world—training entirely in imagination.
Dreamer learns a world model: an internal simulation of how the environment behaves and changes in response to actions. Rather than learning purely from trial and error in the real world, the agent can "imagine" what would happen and learn from those imagined experiences. The model operates on latent representations—compact internal codes that capture the essential features of the environment:
The actor (decision-maker) and critic (evaluator) learn entirely from imagined experience. Starting from real states, the agent "dreams" forward using the dynamics predictor, collecting predicted rewards and refining its policy without any actual environment interaction. DreamerV3 was the first algorithm to collect diamonds in Minecraft from scratch, without human demonstrations—purely through imagination.
Dreamer employs mathematical techniques for stable learning: predictions are made in a compressed scale that handles extreme values, returns are normalized relative to recent experience, and the world model is prevented from collapsing into trivially simple predictions. A single set of settings works across over 150 diverse tasks—from Atari games to robotics.
DreamerV4 replaces the recurrent model with an efficient transformer architecture, enabling world models with billions of parameters. Key innovations:
DreamerV4 is the first agent to obtain diamonds in Minecraft purely from pre-recorded data, without any environment interaction.
Four frameworks developed independently converge on a shared architectural principle: cognition is the coordinated operation of memory processes distributed across timescales.
Conceptual Vocabulary
Buckner's Domain General Modular Architecture draws on faculty psychology—the philosophical tradition that the mind consists of distinct capacities or "faculties"—to argue that cognition is organized into functional subsystems operating at different retention horizons. Provides the interpretive lens connecting mechanisms to a coherent story about faculties and timescales.
Offline Consolidation
The world model provides counterfactual rollouts and a natural locus for "sleep" phases. Imagination training is the computational analogue of an empiricist "mental laboratory"—recombining learned regularities to evaluate possible futures and consolidate experience into policy.
Selective Encoding
Surprise-gated memory provides a principled write signal and capacity management through momentum and forgetting. Determines what experiences are promoted into longer-lived stores—the missing selection mechanism in current "long context" approaches.
Multi-Frequency Substrate
The Continuum Memory System provides a spectrum of update rates so information can persist and stabilize without requiring full retraining or remaining trapped in fleeting attention. The temporal substrate that connects fast and frozen memory.
Systems with only fast contextual state (attention) and frozen knowledge (pretrained weights) will systematically struggle with experience accumulation. Robust experiential learning requires intermediate memory horizons.
Effective memory relies on well-implemented selection (what is written), stabilization (what persists), transformation (what becomes reusable), and controlled reuse (how stored structure guides action). These must interact and modulate one another.
The most successful proposals do not hard-code domain concepts; they improve the general machinery of retaining and consolidating experience. This aligns with an empiricist commitment to domain-general architectural support for learning.
A constructive test of whether coordinating memory operations across timescales yields stable post-deployment learning in an embodied agent.
Melinoe assigns each framework a clear role within an integrated architecture that coordinates three temporal operations:
Sliding-window attention handles immediate perception and action selection within the current context. This is the agent's "working memory"—precise but ephemeral.
Titans' surprise-gated neural memory writes only prediction-violating experiences into persistent storage. Momentum tracks significance over time; weight decay manages capacity.
Periodic sleep phases use Dreamer's world model to replay and recombine experiences. The Continuum Memory System stabilizes information across multiple frequency bands, from session-level to task-level knowledge.
The architecture is evaluated on a Rock-a-Stack toy in a virtual reality environment—recreating toddler-like learning conditions with no pretraining, only hands-on play. A human proctor specifies success criteria, then the agent learns episodically: interacting, retaining salient experience, and periodically entering sleep-like consolidation phases.
The key prediction is transfer after learning: consolidated representations should support faster, lower-error performance on novel configurations—genuine consolidation, not mere context extension.
Agent interacts with the environment. Attention processes immediate context. Surprising events are written to neural memory via surprise signals.
Agent disconnects from environment. World model replays and imagines trajectories. CMS layers absorb stabilized knowledge. Policy improves from dreamed experience.
Consolidated memory enables generalization to novel configurations. The agent performs better on new tasks without catastrophic drift or full retraining.
Without the compute to train a full model from scratch, a hybrid approach tests the core mechanisms by surgically inserting HOPE modules into an existing pretrained model.
Training a Transformer from scratch requires enormous computational resources. A full implementation of the Melinoe architecture would replace the standard MLP and attention blocks entirely with HOPE modules, training everything end-to-end.
Without access to that scale of compute, a different strategy is needed: take an existing pretrained model and modify it to test whether the proposed memory mechanisms actually work.
The proof of concept uses TinyLlama (a 1.1 billion parameter language model) as a frozen backbone. Its original weights are completely locked—they never change. Instead, small HOPE adapter modules are inserted at three strategic layers (5, 11, and 17 out of 22 total), positioned to intercept information at early, middle, and late processing stages.
Each adapter contains two components working together:
The adapter output is added to the backbone's hidden state with a learned scaling factor, initially set very small (0.01) so the adapters must gradually earn influence over the model's predictions.
When the model makes a prediction, the error signal (how wrong was the prediction?) flows backward through the network. At each adapter location, this error signal is captured and used as a teach signal for the Titan memory—telling it what information should be stored or updated. This mirrors how surprise drives memory encoding in the Titans framework.
A synthetic test that buries a specific piece of information (the "needle") at a controlled depth within a long stretch of irrelevant text (the "haystack"). The model must retrieve the needle when asked.
This tests selective retrieval—the same cognitive capacity Titans' surprise-gated memory is designed to support. By varying context length (how much hay?) and depth (where is the needle?), the test maps exactly where memory begins to fail.
The model is shown a random 6-digit passkey within a long context, then asked to reproduce it exactly. Unlike NIAH, there is no semantic content to rely on—the passkey is arbitrary digits.
This tests pure memorization capacity—whether the memory system can faithfully store and retrieve precise information that has no prior association. A strict test of the memory's fidelity, not its ability to pattern-match.
Two iterations of the hybrid adapter demonstrate measurable improvements in memory-dependent tasks, validating the core mechanisms even under severe compute constraints.
The first hybrid adapter was evaluated on NIAH across three context lengths (512, 1024, 2048 tokens) and three insertion depths (beginning, middle, end). Four conditions were compared: the unmodified base model, the frozen backbone with CMS-only adapters (no Titan memory), HOPE without test-time memorization, and HOPE with active memory updates.
At 2048 tokens with the needle at the midpoint, HOPE achieves 100% retrieval accuracy where the unmodified base model manages only 60%. CMS alone reaches 80%—the Titan memory component accounts for the remaining gap.
At 2048 tokens with the needle at the very beginning (depth 0.0), all models struggle. The base model drops to 0%, and even HOPE only reaches 20%. This suggests the adapter cannot yet recover information that has been entirely displaced from the backbone's attention window.
At 512 and 1024 tokens, all conditions—including the base model—achieve near-perfect accuracy. These lengths fall within TinyLlama's trained context window, so the backbone handles them natively.
Version 1 demonstrated that the memory mechanism works within the backbone's native context window, but collapsed at longer sequences. Version 2 targets this limitation through five changes:
The backbone's positional encoding (RoPE) was designed for sequences up to 2048 tokens. By applying scaling and interpolation to these encodings, the backbone can process longer sequences without the position-dependent collapse seen in v1. This is the single most impactful change—without it, the backbone's attention mechanism produces meaningless outputs beyond its trained window.
In v1, the surprise gate parameters (α, θ, η) were fixed constants. In v2, the surprise magnitude itself serves as an adaptive gate—when the memory's prediction error is large, the update is proportionally stronger. This is more faithful to the Titans paper's intent: surprise intensity naturally modulates how aggressively the memory writes, without requiring a separate learned gate network.
Learnable prefix tokens are added to stabilize the model's initial attention behavior and provide a fixed reference point for retrieval. These function like a "scratch pad" that the model learns to use for storing recurring patterns.
The teach signal is now projected through learned key and value mappings before updating the Titan memory, better matching the associative memory formulation from the Titans paper. This ensures the memory stores information in a format optimized for later retrieval.
At 3072 tokens, where v1 collapses to 15% (worse than the base model's 20%), v2 achieves 82% accuracy. At 4096 tokens—double the backbone's trained context window—v2 reaches 64% where v1 scored 0%. The position extrapolation fix unlocked the adapter's ability to operate beyond the backbone's native range.
Passkey retrieval at 4096 tokens shows the same pattern: v2 achieves 62% exact reproduction where both the base model and v1 score 0%. At 2048 tokens, v2 matches CMS performance at 98%—near-perfect memorization within the trained window.
The hybrid adapter demonstrates that surprise-gated memory and multi-frequency update schedules produce measurable improvements in retrieval tasks, even when grafted onto a frozen backbone that was never designed for them. But this approach faces fundamental constraints:
These results are a lower bound. Every improvement the hybrid achieves despite its constraints suggests a larger effect from first-principles integration.