MUXI Memory Systems
Comprehensive guidelines for implementing memory systems in AI applications, covering buffer memory (short-term), long-term storage, user-aware memory partitioning, vector operations, and retrieval strategies.
What This Skill Does
This skill provides detailed best practices for building memory systems that enable AI agents to store, retrieve, and utilize information across conversations. It covers three memory layers (buffer, long-term, and user-aware), vector operations, retrieval strategies, and performance optimization.
Instructions
1. Buffer Memory (Short-Term Memory)
Implement efficient in-memory storage for active context:
**Use FAISS for similarity search** — Facebook AI Similarity Search library for efficient nearest-neighbor retrieval**Normalize vectors properly** — Apply L2 normalization when using cosine similarity**Chunk documents appropriately** — Split long documents into semantically coherent chunks (e.g., 512-1024 tokens)**Implement memory pruning** — Remove old or irrelevant memories to stay within context limits**Serialize vectors for persistence** — Save FAISS indexes to disk for recovery**Choose appropriate similarity metrics** — Cosine similarity for normalized vectors, dot product for raw embeddings**Optimize index parameters** — Use IndexFlatL2 for small datasets, IndexIVFFlat for larger collections**Handle retrieval failures gracefully** — Provide fallbacks when similarity search fails**Store metadata alongside vectors** — Track timestamps, sources, user IDs for filtering**Implement memory summarization** — Condense old memories when approaching context limits**Support memory updates** — Handle scenarios where existing memories need modification2. Long-Term Memory (Persistent Storage)
Use PostgreSQL with pgvector extension for durable memory storage:
**Design proper schema** — Create tables with vector columns, metadata fields, and proper constraints**Add vector indexes** — Use HNSW or IVFFlat indexes on vector columns for fast retrieval**Optimize vector queries** — Use cosine distance operator (`<=>`) or L2 distance (`<->`)**Handle large data volumes** — Implement partitioning for tables exceeding millions of rows**Implement cleanup/archiving** — Periodically remove or archive old memories based on retention policies**Support metadata filtering** — Combine vector similarity with WHERE clauses (e.g., filter by user_id, timestamp)**Use transactions** — Wrap related operations (insert + update + delete) in atomic transactions**Implement connection pooling** — Use pgBouncer or application-level pooling for concurrency**Handle concurrent access** — Use row-level locking for memory updates**Support import/export** — Provide utilities to backup and restore memory collections**Monitor database performance** — Track query times, index usage, and disk space3. Memobase (User-Aware Memory)
Partition and isolate memories by user for multi-tenant systems:
**Partition by user_id** — Always include user_id in queries to isolate user memories**Implement access control** — Enforce permissions at the database and application layer**Support memory sharing** — Allow users to share specific memories with collaborators**Handle user deletion** — Cascade delete or anonymize memories when users leave**Support memory migration** — Provide tools to transfer memories between accounts**Track memory usage** — Implement analytics to monitor per-user memory consumption**Apply retention policies** — Allow different expiration rules per user or plan tier**Prioritize memories** — Rank memories by relevance, recency, or user-defined importance**Resolve conflicts** — Handle cases where similar memories exist with different content**Support admin search** — Allow administrators to search across all users for support/debugging**Backup per user** — Implement user-scoped backup and restore mechanisms4. Vector Operations
Optimize embedding generation and vector handling:
**Choose appropriate dimensions** — Match embedding model output (e.g., 384 for MiniLM, 1536 for text-embedding-ada-002)**Normalize vectors** — Apply L2 normalization before cosine similarity comparisons**Use dimensionality reduction** — Apply PCA or UMAP for very high-dimensional vectors if needed**Benchmark similarity approaches** — Test cosine vs. dot product vs. Euclidean distance for your use case**Optimize storage format** — Use float16 instead of float32 when precision loss is acceptable**Cache frequent operations** — Store precomputed vectors for frequently accessed items**Batch vector operations** — Process multiple embeddings in one call to reduce overhead**Preprocess inputs** — Tokenize, truncate, and normalize text before embedding**Handle OOV tokens** — Gracefully handle unknown words in embedding models**Support multiple models** — Allow swapping between embedding providers (OpenAI, Cohere, local models)**Implement fallbacks** — Retry or use alternative models when embedding generation fails**Document vector format** — Clearly specify dimensionality, normalization, and model used5. Memory Retrieval
Implement intelligent search strategies:
**Score relevance** — Use similarity scores to rank retrieved memories**Hybrid search** — Combine vector similarity with keyword (BM25) or metadata filters**Rank results** — Apply ranking algorithms (e.g., reciprocal rank fusion for hybrid search)**Context-aware retrieval** — Use conversation history to refine memory queries**Filter by metadata** — Support filtering by timestamp, user_id, source, or custom tags**Optimize top-k retrieval** — Use approximate nearest neighbor (ANN) for large collections**Deduplicate memories** — Remove or merge similar memories in results**Time-weighted retrieval** — Boost recent memories when appropriate**Implement recency bias** — Decay scores for older memories in time-sensitive applications**Handle errors gracefully** — Return empty results or cached fallbacks on retrieval failures**Paginate large results** — Support cursor-based or offset pagination for large memory sets6. Memory Integration (Using Memories in Prompts)
Incorporate retrieved memories into AI prompts:
**Prioritize memories** — Rank by relevance when context limits prevent using all memories**Summarize when needed** — Condense old or lengthy memories to save tokens**Support integration strategies** — Use prepend, append, or interleaved memory injection**Attribute sources** — Cite memory sources in responses for transparency**Handle conflicts** — Resolve or flag contradictory information from different memories**Weight by relevance** — Apply importance scores to memories in prompt construction**Stream integration** — Incrementally add memories during generation if supported**Convert formats** — Adapt memory format for different LLM providers (OpenAI, Anthropic, etc.)**Collect during conversations** — Extract and store new memories from user inputs**Implement feedback loops** — Learn which memories are useful based on retrieval patterns**Document patterns** — Provide examples of effective memory integration7. Performance and Scalability
Optimize memory systems for production workloads:
**Optimize vector operations** — Use SIMD instructions, GPU acceleration, or optimized libraries (FAISS, hnswlib)**Implement caching** — Cache frequent queries, embeddings, and retrieval results**Batch operations** — Group inserts, updates, and retrievals to reduce overhead**Monitor performance** — Track latency, throughput, and error rates for memory operations**Shard large stores** — Partition memory collections by user, time, or hash for horizontal scaling**Optimize database queries** — Use EXPLAIN ANALYZE to identify slow queries and add indexes**Use connection pooling** — Reuse database connections to reduce overhead**Index frequently queried fields** — Add indexes on user_id, timestamp, and metadata columns**Benchmark realistic workloads** — Test with production-like data volumes and query patterns**Implement circuit breakers** — Prevent cascading failures when memory systems are unavailable**Support horizontal scaling** — Design for distributed memory systems with multiple nodes**Document performance characteristics** — Specify expected latency, throughput, and scaling limitsExample Usage
Buffer Memory with FAISS
```python
import faiss
import numpy as np
Create FAISS index
dimension = 384
index = faiss.IndexFlatL2(dimension)
Add normalized vectors
vectors = np.random.random((100, dimension)).astype('float32')
vectors = vectors / np.linalg.norm(vectors, axis=1, keepdims=True)
index.add(vectors)
Search for top-k similar vectors
query = np.random.random((1, dimension)).astype('float32')
query = query / np.linalg.norm(query)
distances, indices = index.search(query, k=5)
```
Long-Term Memory with PostgreSQL + pgvector
```sql
-- Create table with vector column
CREATE TABLE memories (
id UUID PRIMARY KEY,
user_id UUID NOT NULL,
content TEXT NOT NULL,
embedding VECTOR(384),
created_at TIMESTAMP DEFAULT NOW()
);
-- Add HNSW index for fast similarity search
CREATE INDEX ON memories USING hnsw (embedding vector_cosine_ops);
-- Query similar memories with metadata filter
SELECT id, content, 1 - (embedding <=> '[0.1, 0.2, ...]') AS similarity
FROM memories
WHERE user_id = 'user-123'
ORDER BY embedding <=> '[0.1, 0.2, ...]'
LIMIT 5;
```
Hybrid Search (Vector + Keyword)
```python
Retrieve top-k from vector search
vector_results = faiss_index.search(query_embedding, k=20)
Retrieve top-k from keyword search (BM25)
keyword_results = bm25_search(query_text, k=20)
Combine using reciprocal rank fusion
combined_scores = {}
for rank, doc_id in enumerate(vector_results):
combined_scores[doc_id] = combined_scores.get(doc_id, 0) + 1 / (rank + 60)
for rank, doc_id in enumerate(keyword_results):
combined_scores[doc_id] = combined_scores.get(doc_id, 0) + 1 / (rank + 60)
Return top-k from combined scores
final_results = sorted(combined_scores.items(), key=lambda x: -x[1])[:5]
```
Constraints
**Context limits** — Memory systems must respect LLM token limits (prioritize, summarize, or truncate)**Latency requirements** — Retrieval should complete in <100ms for interactive applications**Data privacy** — User memories must be isolated and access-controlled**Cost considerations** — Balance embedding API costs, database size, and storage expenses**Consistency** — Ensure memory updates are reflected in subsequent retrievals**Error handling** — Gracefully degrade when memory systems are unavailable**Scalability** — Design for growth from thousands to millions of memories per userNotes
**FAISS** is ideal for in-memory short-term storage; use **pgvector** for persistent long-term storage**Hybrid search** (vector + keyword) often outperforms pure vector search for factual queries**Normalize vectors** when using cosine similarity; use raw vectors for dot product similarity**Connection pooling** is critical for multi-tenant systems to avoid database overload**Metadata filtering** significantly reduces retrieval scope — always partition by user_id**Memory summarization** is essential for staying within LLM context limits in long conversationsMonitor **database query performance** with EXPLAIN ANALYZE and adjust indexes as needed