00 The whole system, on one page
Tessera is a tenant-first, Kubernetes-native Postgres. The thesis is a single reversal: instead of carving one big shared Postgres into multiplexed subsystems, we clone a tiny isolated Postgres per tenant — a Cell — and share only the durable plumbing underneath it. Each tenant gets a real, single-writer Postgres process (one OrioleDB-style storage engine, sandboxed in gVisor — a lightweight kernel-emulation sandbox that intercepts the program's system calls so it can't touch the host directly); everyone shares one routing plane (Gateway), one control plane (Conductor), and one durable substrate (Journal → Pagestore → S3). The Cell is stateless compute that can be killed and replayed; the truth lives below it.
The central reversal
You do not carve subsystems out of one Postgres. You run one small Postgres per tenant and share only the durable substrate. Per-tenant isolation, arbitrary native extensions, scale-to-zero, and per-tenant serializability each independently force "one Postgres per tenant" — so we lean into it rather than fight it.
psql / app"] GW["Gateway
route + auth + hold"] subgraph CELLS["Tenant Cells (stateless compute)"] C1["Cell: tenant A
OrioleDB-PG in gVisor"] C2["Cell: tenant B
OrioleDB-PG in gVisor"] C3["Cell: tenant C
parked / cold"] end COND["Conductor
operator + Raft lease"] J["Journal
quorum WAL per tenant"] PS["Pagestore
getPage@LSN"] S3["S3 object store
bottomless truth"] CLIENT --> GW GW --> C1 GW --> C2 C1 -->|"WAL"| J C2 -->|"WAL"| J J --> PS PS --> S3 PS -->|"page@LSN"| C1 COND -.->|"route table"| GW COND -.->|"epoch lease + wake/park"| C1 COND -.-> C2 COND -.-> C3 classDef cell fill:#d3faf3,stroke:#0d9488,color:#0f5f57; classDef route fill:#fdeccb,stroke:#c2620a,color:#8a4708; classDef ctrl fill:#ece6fe,stroke:#7c3aed,color:#5a2bb0; classDef data fill:#dbe6ff,stroke:#2563eb,color:#1c47ab; classDef store fill:#e1e7ef,stroke:#475569,color:#33414f; class C1,C2,C3 cell; class GW route; class COND ctrl; class J data; class PS,S3 store;
Read the diagram in two passes. The solid edges are the data path: a client connects to the Gateway, which authenticates and routes it to that tenant's Cell; the Cell commits writes as a single ordered WAL stream (the write-ahead log — an append-only journal of every change, written before the change is applied, so a crash can always be replayed) into the Journal (durable only once a quorum acks — a majority of replicas must confirm the write before it counts, like requiring 2 of 3 signatures), and reads any page it doesn't have warm in local cache from the Pagestore via getPage@LSN (ask for the contents of a page as of a specific point in the WAL stream — the LSN, or log sequence number, is that point's position), all ultimately backed by S3. The dashed edges are control: the Conductor hands the Gateway its routing table and hands each live Cell a time-bounded epoch lease (a time-limited, numbered grant of "you are the writer right now" — like a relay baton that expires, so a stale holder can't keep writing) that makes it the one and only writer for that tenant — then schedules cells awake and parks idle ones. Control never touches a byte of tenant data.
A few of those headline facts deserve a half-sentence of honesty up front. "$0 compute when parked" is real because a Cell holds no durable local state — kill it and nothing is lost — but waking a cold tenant costs a wake latency (pod schedule + WAL replay), which the Gateway hides by holding the client connection open, Knative-activator style (the request waits at the door while the service spins up from zero, instead of failing). "No vacuum" comes from the OrioleDB-style storage engine (undo-log MVCC — multi-version concurrency control, where instead of overwriting a row each writer keeps the old version aside so readers never block writers + 64-bit transaction IDs): there is no freeze, no autovacuum, no wraparound to manage — the maintenance subsystem doesn't get relocated, it ceases to exist, which is exactly what makes cold-parking free. "Serializable per tenant" is nearly free precisely because each tenant is a single-writer Postgres; the cost is the firm rule that cross-tenant atomic transactions are forbidden in v1.
How to read this document
Each chapter zooms into one box above. Decomposition argues why the cell is forced. The Cell dissects one OrioleDB-PG process. The planes (Gateway / Conductor / Journal / Pagestore) get a chapter each. Query life traces a request end-to-end; scale-to-zero covers park/wake; storage covers WAL → page → S3; consistency covers the epoch lease and SSI (serializable snapshot isolation — Postgres's strongest correctness mode, where concurrent transactions are guaranteed to behave as if they ran one at a time); extensions and security cover gVisor + the microVM tier (a real but tiny hardware-virtualized VM — stronger isolation than a container, used for untrusted code); k8s covers the operator and CRDs; then tradeoffs, roadmap, and risks close it out honestly.
01 Architecture — the whole system
Read it top to bottom as the data path: a client hits the always-on Gateway, which routes into one tenant's isolated Cell; the cell appends to the quorum Journal, the journal ships committed WAL to Pagestore, and Pagestore serves versioned pages back to the cell while backing everything to S3. Off to the side, the Conductor control plane hands out the single-writer epoch lease, places and parks cells, and pushes the route table — never on the request hot path. Solid arrows are the data path; dashed arrows are control.
apps and psql"] subgraph ROUTE["Gateway plane"] GW["Gateway
TLS plus SCRAM auth
tenant router
holds conn while waking
always-on stateless"] end subgraph CTRL["Conductor plane"] COND["Conductor
k8s operator plus etcd
epoch lease and fence
place park wake
extension Catalog and signing"] end subgraph CELLPLANE["Tenant Cells plane"] subgraph CELL1["Tenant Cell A · gVisor sandbox"] PG["OrioleDB Postgres
stateless compute"] EXT["in-process
signed extension"] PG --- EXT end CELL2["Tenant Cell B
warm · microVM tier"] CELL3["Tenant Cell C
parked · awaiting wake"] end subgraph DATAPLANE["Journal plane"] JRN["Journal
quorum WAL · 3 replicas
majority commit"] end subgraph STOREPLANE["Storage plane"] PS["Pagestore
versioned pages
NVMe cache
serves getPage@LSN"] S3["S3
bottomless
source of truth"] end CLIENT -->|route| GW GW -->|route| PG PG -->|append WAL| JRN JRN -->|quorum ack| PG JRN -->|ship committed WAL| PS PS -->|getPage at LSN| PG PS <-->|persist and fetch| S3 COND -.->|route table| GW GW -.->|wake signal on cold hit| COND COND -.->|epoch lease and place park wake| CELL1 COND -.->|bump epoch fence stale writer| JRN COND -.->|load signed extension| EXT class CLIENT route; class GW route; class COND ctrl; class PG,EXT,CELL2,CELL3 cell; class JRN data; class PS,S3 store;
| Flow | Path type | What happens |
|---|---|---|
| Client → Gateway → Cell | solid · data | TLS + SCRAM auth, then the tenant router pins the connection to the owning cell and holds it open while a parked cell wakes. |
| Cell ↔ Journal | solid · data | The cell appends WAL; the journal returns a quorum ack once a majority of its 3 replicas commit — that ack is the durability point. |
| Journal → Pagestore | solid · data | Committed WAL ships forward; Pagestore materializes versioned pages so cells stay stateless compute. |
| Pagestore → Cell · Pagestore ↔ S3 | solid · data | Pagestore serves getPage@LSN from its NVMe cache and persists/fetches against S3, the bottomless source of truth. |
| Conductor → Gateway, Cell, Journal | dashed · control | Pushes the route table, issues the single-writer epoch lease and fences stale writers, and places / parks / wakes cells. |
| Gateway → Conductor · Conductor → extension | dashed · control | A cold/parked hit raises a wake signal; the Conductor also loads the signed extension from its Catalog into the cell. |
02 The core reversal: clone Postgres, don’t carve it
The seductive instinct when you want a "cloud-native Postgres" is to take the monolith apart — peel off vacuum here, the buffer manager there, the lock table somewhere else — and reassemble the pieces as microservices that scale independently. That instinct is mostly wrong, and understanding exactly why is the intellectual foundation of everything Tessera does. Postgres has a handful of subsystems that genuinely cut clean — and a dense core that does not cut at all, because those parts don’t live inside the instance: they are the instance.
Start with the parts that separate cleanly. Four of them already exist as standalone services in the wild, so this is not speculation:
Connection / auth proxy
TLS termination, SCRAM-SHA-256 authn, authz, routing. Pure request-edge work — no MVCC snapshot needed (MVCC, multi-version concurrency control: each writer makes a new copy of a row rather than overwriting, so readers never block writers; a snapshot is the consistent view one transaction sees). Already its own tier: PgBouncer, Supavisor, Neon-proxy. This becomes Tessera’s Gateway.
WAL durability
The write-ahead log (an append-only journal of every change, written before the change is applied to the data, so a crash can always be replayed) is an ordered byte stream. Quorum-committing it (a majority of replicas must agree before a write counts — like requiring 2 of 3 signatures) needs only the stream + an epoch fence (a generation number that locks out a previously-live writer so only one can win), not the executor. Neon’s safekeepers and Aurora’s quorum prove it. Becomes the Journal.
Page materialization / storage
"Given committed WAL, produce versioned pages and serve getPage@LSN" (fetch a page as of a given LSN — the log sequence number, a monotonically increasing position in the WAL, so the page is returned at an exact point in history). A consumer of the log, decoupled from any backend. Neon’s pageserver + OrioleDB’s S3 layer. Becomes the Pagestore.
Logical replication
Decoding WAL into row-level change events is a downstream tap on the same ordered stream — a clean reader, not a participant in the live transaction.
Now the dense core. The transaction manager, the snapshot/visibility horizon (the ProcArray — Postgres’s in-memory list of every live transaction — plus the clog, the commit-status log), the buffer manager, the lock manager, the postmaster process model, and the extension/hook mechanism are effectively inseparable. Not because nobody has tried, but because they all read and mutate the same shared-memory state in the same address space, under the same spinlocks and buffer pins. A backend deciding whether row X is visible consults the live ProcArray snapshot; that decision is meaningless across a network hop because the horizon it depends on moves microsecond to microsecond. Pull any one of these onto its own pod and you don’t get a microservice — you get a distributed consensus problem wrapped around what used to be a pointer dereference, on the hot path of every single query.
The reversal, stated plainly
You cannot carve one Postgres into per-tenant subsystems, because its core is a single shared-memory MVCC machine — the coupled parts don’t sit inside the instance, they constitute it. So you do the opposite: clone the whole small instance per tenant and share only the durable substrate beneath it (Journal + Pagestore + S3). The unit of scaling is the entire cell, not a subsystem of it.
The "vacuum as a microservice" myth
This is the trap worth naming explicitly, because it’s the one everybody reaches for first. Vacuum looks like background maintenance — surely it can be a worker pool on its own pods, sweeping dead tuples on a schedule (a dead tuple is the old copy of a row left behind by an MVCC update or delete, dead weight once no transaction can still see it)? It cannot. Vacuum’s correctness depends on the same shared snapshot horizon every live backend depends on: it may only reclaim a tuple version once it proves no running transaction can still see it — which means reading the live ProcArray in this instance. And to physically remove a tuple it must take the per-buffer cleanup lock — an exclusive pin no other backend can hold — page by page, inside this instance’s buffer pool. Both of those are in-process, shared-memory operations. Lift vacuum onto its own pod and it can neither compute a correct horizon nor acquire the pins; it would have to coordinate every reclamation with every backend over the wire. That isn’t a microservice — it’s the coupling, relocated and made slower.
There are only two honest ways out. (a) Dissolve the cleanup into the storage tier — let the page materializer garbage-collect old versions as it compacts, which is roughly what a log-structured Pagestore does. Or (b) eliminate the need for it: swap heap MVCC for an undo-log storage engine (OrioleDB-style, with 64-bit XIDs) so dead versions go to an undo log that’s trimmed inline — no dead-tuple sweep, no freeze, no autovacuum, no transaction-id wraparound (Postgres numbers transactions with a counter that, when 32-bit, eventually runs out and rolls over — risking old rows suddenly looking "in the future"; wider 64-bit IDs make that overflow a non-event). Tessera takes route (b). The maintenance subsystem isn’t relocated; it ceases to exist. forward-ref ch.06 — and it’s precisely this elimination that makes cold-parking a cell truly free, since there’s no freeze-before-park debt to pay.
| Subsystem | Separability | What Tessera does with it |
|---|---|---|
| Connection / auth proxy | Clean | Extracted as the always-on stateless Gateway (TLS, SCRAM, routing, cold-wake hold) |
| WAL durability | Clean | Extracted as the Journal — quorum commit of one ordered LSN stream per cell, epoch-fenced |
| Query / executor layer | Clean | This is the stateless Tenant Cell — runs warm, recovers by WAL replay |
| Logical replication (WAL decode) | Clean | Downstream reader of the Journal stream; an optional per-tenant tap |
| Page materialization | Feasible | Pushed below the cell into the Pagestore (getPage@LSN; S3 = source of truth, NVMe = cache) |
| Transaction manager | Inseparable | Stays whole inside the cell — single-writer per tenant gives SSI serializability free (SSI = the strongest isolation level, where concurrent transactions behave as if run one-at-a-time; trivially true when only one writer exists) |
| MVCC snapshot / ProcArray / clog | Inseparable | Stays in the cell’s shared memory; the visibility horizon is the per-tenant boundary |
| Buffer manager | Inseparable | Stays in the cell; its NVMe-backed buffers are a cache over the Pagestore |
| Lock manager | Inseparable | Stays in the cell — lock tables are per-tenant, never cross-tenant (no global locks) |
| Postmaster / process model | Inseparable | Stays whole — one postmaster per cell, the cell is the unit we schedule/park |
| Autovacuum / vacuum | Inseparable | Eliminated — undo-log engine removes the subsystem entirely (ch.06) |
| Extensions / shared_preload_libraries | Inseparable | Stays in-process (it hooks core) — isolated instead via gVisor sandbox per cell (gVisor is a lightweight sandbox that intercepts a program’s system calls in user space, walling off untrusted extension code from the host kernel) (ch.03) |
Read the table top to bottom and the shape of Tessera falls out of it: the clean cuts become the shared substrate, the inseparable core gets cloned per tenant, and the one subsystem nobody can cleanly cut or usefully relocate — vacuum — gets designed out of existence. The diagram below contrasts the two mental models.
+ TxMgr + ProcArray
+ buffers + locks + vacuum
all one address space"] end subgraph TESS["N cells + one shared substrate"] direction TB C1["Tenant Cell A
OrioleDB-PG"] C2["Tenant Cell B
OrioleDB-PG"] G["Gateway
route + auth"] J["Journal
WAL quorum"] P["Pagestore + S3
pages and truth"] G --> C1 G --> C2 C1 --> J C2 --> J J --> P C1 --> P C2 --> P end MONO -.->|"clone, don't carve"| TESS classDef cell fill:#d3faf3,stroke:#0d9488,color:#0f5f57; classDef route fill:#fdeccb,stroke:#c2620a,color:#8a4708; classDef ctrl fill:#ece6fe,stroke:#7c3aed,color:#5a2bb0; classDef data fill:#dbe6ff,stroke:#2563eb,color:#1c47ab; classDef store fill:#e1e7ef,stroke:#475569,color:#33414f; class M,C1,C2 cell; class G route; class J data; class P store;
The honest edge of this choice
Cloning per tenant means you pay an instance’s baseline overhead per tenant instead of amortizing one shared buffer pool across all of them — so the cells must be small, and scale-to-zero (cold-parking) is not a nicety but a requirement to keep idle cost near zero. That requirement is exactly what makes the vacuum-elimination decision load-bearing rather than cosmetic: parking is only free if there’s no freeze debt to settle first.
03 Anatomy of a Tenant Cell
A Cell is the atom of Tessera: one tenant, one OrioleDB-Postgres process, running inside a gVisor-sandboxed pod (gVisor is a security layer that wraps the process so its low-level OS requests never touch the real host kernel). It is real Postgres — it parses SQL, plans queries, holds the catalog, and loads the tenant's extensions in-process — but it is stateless compute. It owns no durable local state. Kill it, reschedule it, park it cold; it comes back by replaying its WAL (the write-ahead log — an append-only journal of every change, written before the change is applied, so a crash can always be replayed forward). Everything that makes a database "stick to a machine" has been delegated away.
What lives inside, and what does not
A warm Cell holds exactly the state Postgres needs to answer queries fast: the tenant's system catalog and relcache (the in-memory map of tables, indexes, types, and functions), a local page cache on NVMe, and the tenant's extensions loaded as .so objects directly into the backend's address space. That is the whole footprint. None of it is the source of truth — it is all reconstructable. The local NVMe / PVC is a cache, never a home: if it vanishes, the Cell refills it from Pagestore on the next page miss.
The reason a Cell can be this thin is a strict division of labor. The Cell does the CPU-bound, tenant-specific work — parse, plan, execute, hold catalog, run extensions — and delegates everything stateful or shared to the other planes:
Durability → Journal
The Cell does not fsync WAL to a local disk it trusts (fsync is the call that forces buffered writes all the way down to physical disk). It appends its single ordered WAL stream to the Journal, which quorum-commits it (a quorum is a majority of replicas agreeing before a write counts — like needing 2 of 3 signatures). Commit acknowledgement comes from the quorum, not from local storage.
Page materialization → Pagestore
The Cell never owns the canonical heap (the heap is where Postgres physically stores table rows on disk). On a cache miss it asks Pagestore getPage@LSN — "give me page N as of this log position" — and Pagestore reconstructs it from ingested WAL over S3.
Routing & auth → Gateway
The Cell sees only already-authenticated, already-routed traffic. TLS termination, SCRAM-SHA-256 authn, authz, and the hold-while-waking trick all happen upstream in the Gateway.
Fencing & placement → Conductor
The Cell does not decide that it is the writer. The Conductor grants it a single-writer epoch lease — a time-bounded, numbered permission slip to be the one writer; a newer number fences out any stale Cell that thinks it still holds the slot — and schedules where it runs, when it parks, and when it wakes. The Cell just honors the lease.
That delegation is the entire trick. Because the Cell exports its durability, its pages, its routing, and its authority, the process itself carries no irreplaceable state — which is what makes it disposable.
One cell, drawn
parse / plan / execute"] OD["OrioleDB engine
undo-log MVCC, no vacuum"] EXT["Tenant extensions
loaded in-process .so"] CACHE["Local page cache
NVMe — cache only"] BE --> OD BE --> EXT OD --> CACHE end J["Journal
WAL append, quorum commit"] PS["Pagestore
getPage@LSN"] CD["Conductor
single-writer lease / epoch"] OD -- "append WAL stream" --> J CACHE -- "page miss: getPage@LSN" --> PS CD -. "epoch lease (fence)" .-> BE classDef cell fill:#d3faf3,stroke:#0d9488,color:#0f5f57; classDef route fill:#fdeccb,stroke:#c2620a,color:#8a4708; classDef ctrl fill:#ece6fe,stroke:#7c3aed,color:#5a2bb0; classDef data fill:#dbe6ff,stroke:#2563eb,color:#1c47ab; classDef store fill:#e1e7ef,stroke:#475569,color:#33414f; class BE,OD,EXT,CACHE cell; class J data; class PS store; class CD ctrl;
The isolation onion
A Cell runs arbitrary native-C extension code, so it is wrapped in defense-in-depth. The dangerous innermost layer — a tenant-uploaded .so running in-process — sits behind a gVisor sentry that intercepts and reimplements syscalls in userspace, so the tenant's code never talks straight to the host kernel. Read the onion from the host inward:
Extensions that trip gVisor's syscall surface — or tenants who pay for maximum isolation — escalate to a Kata/Firecracker microVM tier, covered in chapter 09.
Stateless compute = disposable = parkable & relocatable
Because the Cell holds no durable truth — only reconstructable cache — it can be killed mid-flight, rescheduled to any node, or parked cold to zero with no data at risk. Recovery is just "replay WAL from the last durable LSN." Disposability is not a feature bolted on; it falls out directly from exporting durability to the Journal and pages to the Pagestore.
The honest cost: cold start
Disposability is not free — it shows up at wake time. A cold Cell's first byte is the sum of four serial steps, and you pay all of them on a scale-to-zero hit:
The Gateway hides this latency from the client by holding the connection open while the Cell wakes (the Knative-activator pattern), so the user sees a slow query, not a failed one. Driving each of these steps down — and deciding which tenants get to stay warm — is the subject of chapter 06.
04 The four shared planes
A Tenant Cell is the only thing in Tessera that is per-tenant. Everything a cell needs to exist — to be reached, to be scheduled, to commit durably, to read its pages — is rented from four shared, multi-tenant planes. The split is deliberate: per-tenant where isolation and consistency demand it; shared everywhere that pooling buys efficiency, elasticity, and scale-to-zero economics.
◆ The one per-tenant thing
One tenant = one OrioleDB-Postgres cell. The Cell is stateless compute — its local NVMe is only a cache. Below it sits a single, fungible substrate: Gateway (who can reach you), Conductor (where you run + who may write), Journal (your durable WAL — the write-ahead log, an append-only journal of every change recorded before it is applied, so a crash can always be replayed), Pagestore/S3 (your pages, bottomless). Kill a cell and it rebuilds itself by replaying WAL from this substrate.
OrioleDB-PG"] B["Cell · tenant B
OrioleDB-PG"] end subgraph SH["SHARED SUBSTRATE — multi-tenant, always-on"] C["Gateway
route + authn"] D["Conductor
operator + Raft"] E["Journal
WAL quorum"] F["Pagestore + S3
pages · bottomless"] end A -->|getPage@LSN| F B -->|getPage@LSN| F A -->|append WAL| E B -->|append WAL| E C -->|holds conn, wakes cell| A D -->|leases, places, parks| A E -->|ships committed WAL| F classDef cell fill:#d3faf3,stroke:#0d9488,color:#0f5f57; classDef route fill:#fdeccb,stroke:#c2620a,color:#8a4708; classDef ctrl fill:#ece6fe,stroke:#7c3aed,color:#5a2bb0; classDef data fill:#dbe6ff,stroke:#2563eb,color:#1c47ab; classDef store fill:#e1e7ef,stroke:#475569,color:#33414f; class A,B cell; class C route; class D ctrl; class E data; class F store;
What each plane owns, and why it is shared
Each plane is shared for a different reason. The Gateway must be always-on and addressable even while a tenant is parked — it is the front door that survives cold cells, so it cannot be per-tenant. The Conductor holds one global view of placement, leases, and the extension catalog; consensus only works if there is exactly one of it (logically). The Journal and Pagestore amortize the cost of durable quorum storage (a write counts only once a majority of replicas agree — like needing 2 of 3 signatures) and an S3 page cache across thousands of mostly-idle tenants — running a private 3-node quorum per tenant would be ruinous. The shared planes are how scale-to-zero stays cheap: a parked tenant costs only the bytes its WAL and pages occupy in shared storage, not a running anything.
| Plane | Owns | Scaling unit | K8s object | Role in consistency |
|---|---|---|---|---|
| Gateway | TLS termination · SCRAM-SHA-256 authn · authz · per-tenant routing · holds client conn during wake | Connections / sec — add replicas behind one VIP | stateless Deployment + HPA + Service |
None directly — never reorders or commits; just routes to the one live writer |
| Conductor | CRD reconcile · single-writer epoch lease · wake/park + placement · extension catalog & signing · orphan-claim reaper · size class | Cluster-wide singleton (leader-elected); Raft group of 3–5 (Raft is a consensus protocol — a cluster of nodes that vote to agree on one shared truth and survive minority failures) | StatefulSet (etcd/Raft) + operator Deployment |
Issues the fence — exactly one epoch may write per tenant at a time |
| Journal | Quorum-commit of each cell's single ordered WAL stream, keyed by tenant_id · rejects stale-epoch writes |
Per-tenant WAL stream; sharded across safekeeper sets | StatefulSet (safekeepers, 3+ replicas/quorum) |
Durability + logical fencing — a commit is real only at quorum, stale epochs are refused |
| Pagestore / S3 | Ingests committed WAL · materializes versioned pages · serves getPage@LSN (an LSN, log sequence number, is a monotonic byte-offset into the WAL — think commit timestamp; the cell asks for a page as of that point) · S3 = bottomless truth, NVMe = cache |
Bytes stored + getPage QPS; pageservers shard by tenant | StatefulSet (pageservers) + external S3 |
MVCC visibility via versioned pages @ LSN — MVCC, multi-version concurrency control, keeps old copies of each row so readers see a stable snapshot and never block writers — never the commit authority, only a consistent read surface |
Four planes at a glance
Gateway
The front door. Stateless proxy fleet (lineage: Supavisor / Neon-proxy / PgCat). Terminates TLS, authenticates with SCRAM-SHA-256, authorizes, and routes by tenant. The signature trick: it holds the client's connection open while a cold cell wakes — the Knative-activator pattern — so a parked tenant looks like a slow first query, not a refused one.
Scales horizontally on connection load; pure Deployment, no durable state, lose a replica and reschedule freely.
Conductor
The brain. A Kubernetes operator plus a Raft/etcd consensus store. Reconciles tenant CRDs, schedules wake/park and cell placement, owns the signed extension catalog, tracks each tenant's size class, and runs the orphan-claim reaper. Its load-bearing job: mint the single-writer epoch lease — the fence that guarantees one writer per tenant.
Logically a singleton; physically a small leader-elected StatefulSet. Not on the data hot path — it decides, it does not relay queries.
Journal
Durability-as-a-service. WAL quorum (lineage: Neon safekeepers / Aurora quorum). Each cell has exactly one ordered LSN stream keyed by tenant_id; a write is committed only when a majority acks. It also enforces the fence at the storage layer — appends carrying a stale epoch are rejected, so a zombie cell physically cannot corrupt the log.
Stateful by definition (StatefulSet, 3+ replicas per quorum); shards tenant streams across safekeeper sets for capacity.
Pagestore + S3
Bottomless storage. Pageservers (lineage: Neon pageserver + OrioleDB-on-S3) ingest committed WAL and materialize versioned pages, answering getPage@LSN for any cell. S3 is the durable, infinite source of truth; local NVMe is only a hot cache. A cell carries no truth — it reconstructs everything it needs from here on wake.
StatefulSet pageservers fronting external object storage; scales on bytes stored and getPage QPS, sharded by tenant.
◆ Why this split pays
Because truth lives in Journal + Pagestore and identity lives in the Gateway + Conductor, the compute tier becomes disposable. Tens of thousands of tenants can sit parked at near-zero cost, the warm few share an elastic compute pool, and any cell can be killed, moved, or rescheduled with no data loss — it just replays WAL and re-warms its cache. Per-tenant where it must be; shared everywhere it can be.
05 Life of a query — warm and cold
A query in Tessera takes one of two paths depending on whether the tenant's Cell is already running. When the Cell is warm the path is an ordinary Postgres round-trip with one twist: durability is a quorum ack (a majority of copies must confirm the write before it counts — like needing 2 of 3 signatures) on the commit path. When the Cell is cold (parked to save money), the Gateway quietly holds the client socket while the Conductor wakes the Cell — and the Cell replays its WAL (the write-ahead log — an append-only journal of every change, written before the change is applied, so a crash can be replayed) up to the committed point before it answers, so a wake never serves stale data.
Warm path — the Cell is already running
The client connects to the always-on Gateway, which terminates TLS, authenticates with SCRAM-SHA-256, looks up which Cell owns this tenant_id, and forwards bytes to that Cell. The Cell parses, plans, and executes. Reads hit the local page cache on NVMe (a cache, never the source of truth); on a miss it pulls the exact version it needs with getPage@LSN (fetch a page as of a specific LSN — a log sequence number, the monotonic position in the WAL that names a point in time) from the Pagestore. On COMMIT the Cell ships the WAL record to the Journal and waits for a quorum (majority) ack before telling the client the commit succeeded. That wait is the throughput we deliberately spend for durability — the record is on a majority of safekeepers before the client believes it.
Durability quorum is on the commit path on purpose
We do not ack a write until a majority of the Journal has the WAL record. That round-trip is latency we accept so a committed transaction survives a node loss. Speed may bend here; reliability may not.
Cold path — the Cell is parked
If the tenant has been idle, its Cell is scaled to zero — no pod, no compute cost (cold-parking is free because the OrioleDB engine has no vacuum/freeze backlog to settle first). At routing time the Gateway sees no live Cell. It does not error: it holds the client connection open (the Knative-activator trick) and signals the Conductor to wake the tenant. The Conductor schedules a pod and grants the single-writer lease — an epoch/term fence (each grant gets a higher number; writes stamped with a stale number are rejected, so a slow old Cell can't clobber the new one) so exactly one Cell may write this tenant's stream. The Cell starts, asks the Journal for the latest quorum-committed LSN, and replays WAL up to that LSN before accepting any query. Only then does the Gateway forward the connection it was holding. The client saw extra latency, never an error or stale read.
Where the latency goes
The warm path adds only a Gateway hop plus, on commit, one quorum round-trip. The cold path stacks pod scheduling and WAL replay on top of the first query of a parked tenant — a one-time tax paid by the request that triggers the wake, then amortized away while the Cell stays warm.
Warm — steady state
Dominated by Postgres execution. The only Tessera-specific cost is the synchronous quorum ack at COMMIT; reads are usually cache hits. Read-your-writes (a query always sees data you just committed) and serializability (concurrent transactions behave as if run one-at-a-time, in some order) come free from the single-writer Cell.
Cold — first query after park
Sharp edge: the first request after parking eats pod-start + replay latency (hundreds of ms to seconds, depending on WAL backlog). Honest tradeoff — the Gateway hides the error but not the wait. Keep-warm tiers exist for latency-sensitive tenants.
What never bends
- Commit = durable: quorum ack precedes the client ack.
- A wake never serves stale data: replay-to-committed-LSN gates serving.
- Exactly one writer: the epoch lease fences the parked-then-woken Cell.
What we spend
- One quorum round-trip of commit latency, always.
- Cold-start latency on the first query after parking.
- A Gateway hop on every connection.
06 Scale-to-zero & the cold-tenant problem
A reader's first instinct is to scale-to-zero the table they only touch once a month. But Postgres is process-per-instance, not per-table — you cannot freeze one relation while its siblings stay hot in the same backend. That is exactly why tenant = one isolated Postgres cell: the unit you can actually take to zero is the tenant's compute, not a table inside a shared instance. Park a rarely-used tenant and its cell pod simply disappears — $0 compute while its storage sits, untouched and durable, in the Journal and S3.
Why the table instinct fails
"Scale my once-a-month table to zero" assumes a table is an independently schedulable resource. It is not — it shares a process, a shared-buffers pool, a WAL stream (write-ahead log — an append-only journal of every change, written before the change is applied, so a crash can be replayed), and a catalog (Postgres's internal directory of what tables, columns, and indexes exist) with every other table in the same Postgres. The smallest thing you can cleanly evict and recreate is the whole instance. Cloning Postgres per tenant (Decision 1) makes the tenant be that instance, so the cell becomes the natural scale-to-zero boundary.
The lifecycle state machine
A cell moves through four states. The Conductor (control plane) owns every transition: it grants the single-writer epoch lease (a time-limited permission to be the one writer, stamped with a version number — like a numbered hall pass that expires, so a stale holder is easy to reject) on wake, and revokes/expires it on park. Durable state never moves — only compute appears and disappears.
- Active — pod scheduled, lease held, catalog/relcache warm, serving queries.
- Idle — a grace period after the last query. The pod is still warm; a returning query skips cold-start entirely. This buffer prevents thrash for bursty tenants.
- Parked — grace expired. Conductor revokes the epoch lease (so the Journal will fence any zombie writer — fencing means actively blocking a stale process from writing once its lease is gone, instead of trusting it to notice and stop) and tears down the pod. Compute bill = $0; the tenant's bytes live on in S3.
- Waking — a connection arrived. Conductor schedules a fresh pod, grants a new epoch lease, and the cell replays WAL forward to its last committed LSN (log sequence number — a monotonic byte-offset address into the WAL, like a page number marking exactly how far the log was replayed) before accepting traffic.
The Gateway hides the wake from the client using the Knative-activator trick: it holds the open client connection — TCP established, the client just waits — while the cell wakes, then splices traffic through once the cell reports ready. Below the hold-timeout this looks like a slow first query, not an error.
Why parking is genuinely free
Here is the load-bearing point, and it is about vacuum (Postgres's background cleanup that reclaims space and resets row bookkeeping), not about pods. On a stock Postgres heap, transaction-id (XID) wraparound advances with global commit volume across the whole instance — every transaction gets a 32-bit ID, and that counter can run out and wrap around like a car odometer rolling over; old rows must be "frozen" before that happens or data looks like it's from the future — the 32-bit XID space is consumed by activity anywhere. A parked tenant cannot opt out of time: as the rest of the fleet commits, its un-frozen tuples drift toward the wraparound horizon. Eventually anti-wraparound autovacuum must run on those tables, which means a parked tenant would be forced awake (or the system force-shuts-down to protect data). The only stock defense is a VACUUM FREEZE ritual before every park — turning "free parking" into "scan-the-whole-tenant-first parking," which is exactly the cost that kills the economics for big, cold tenants.
Parking is free because vacuum is dead
Tessera's OrioleDB-style storage engine (Decision 2) uses undo-log MVCC (multi-version concurrency control — concurrent readers and writers don't block each other; old row versions are kept in a separate undo log instead of being left as dead rows in the table) and 64-bit XIDs (a counter so vast it never realistically wraps). There is no wraparound horizon to outrun, no freeze, no autovacuum — the maintenance subsystem doesn't exist (forward-ref ch07). A parked tenant ages not at all: its pages on S3 are valid forever, regardless of fleet-wide commit volume. Parking becomes a pure pod teardown with zero pre-park maintenance ritual, ever. This is the single fact that makes cold-parking economically real rather than aspirational.
The honest cost: cold-start
Free parking does not mean free waking. A cold wake pays a sequential budget, and big-schema tenants have a worse tail.
Cold-start has a fat tail — be honest about it
The dominant cost for a real tenant is not pod scheduling — it's catalog/relcache warming (a tenant with tens of thousands of relations, partitions, and indexes pays linearly) plus page-fault-then-fetch from the Pagestore for the working set. A tenant with a huge schema and a long WAL gap can blow past the Gateway hold-timeout, surfacing as a visible first-query stall. This is the sharp edge of scale-to-zero: idle is cheap, the first wake of a fat tenant is not.
Mitigations we keep warm
Keep the Journal and Pagestore always-on so a waking cell pulls pages and WAL from a hot quorum (a majority of replicas must agree before data counts as durable — like requiring 2 of 3 signatures), never from cold S3. Page prefetch primes the working set before the first query lands.
Mitigations on the cell
Warm catalog snapshots — persist a relcache/catalog image so wake skips full re-derivation. Partial / lazy WAL replay — accept connections after replaying to a consistent point, then replay-on-demand so the tail doesn't block first response.
07 Storage, MVCC, and the death of vacuum
This is the meaty floor of Tessera: how a row version is kept, and where bytes actually live. Two moves combine. First, an OrioleDB-style storage engine deletes vacuum outright rather than relocating it. Second, the cell is made stateless — durability and page materialization move outside the Postgres process into the Journal and Pagestore. The payoff is a clean fit: OrioleDB's S3 storage is single-writer-per-namespace, which is exactly Tessera's single-writer-per-tenant rule. The ambitious engine and the locked consistency model are the same shape.
Why vacuum exists — and why it can simply stop existing
Stock Postgres uses heap MVCC (multi-version concurrency control — each writer makes a new copy of a row instead of overwriting it, so readers never block writers): an UPDATE or DELETE does not overwrite or remove the old row, it leaves a dead tuple in place and writes a new version next to it. Dead tuples accumulate inside the table (bloat), so a background process — autovacuum — must crawl the heap to reclaim that space and to freeze old rows before the 32-bit transaction id (XID) counter wraps around. Wraparound is a hard correctness cliff: miss the freeze deadline and the database shuts down to avoid losing data. That entire machine — autovacuum workers, freeze, the wraparound clock — is maintenance that the heap design forces on you.
OrioleDB-style undo-log MVCC inverts it. The current row version lives in the table; old versions are written to a separate undo log, not parked in the heap. A transaction that needs an older snapshot walks the undo chain; once no snapshot can see a version, the undo space is reclaimed by simple log truncation — no full-table scan to hunt dead tuples. Pair that with 64-bit XIDs and the wraparound clock effectively never rings. Net: no vacuum, no freeze, no autovacuum workers, no bloat-driven maintenance. The "is vacuum its own subsystem?" question is answered by deletion, not relocation — and that is precisely what makes cold-parking free: there is no freeze-before-park step (forward-ref to ch06 scale-to-zero).
left in table"] B --> G["bloat grows"] G --> H["autovacuum scan
+ freeze + wraparound watch"] end subgraph ORIO["Undo-log MVCC"] C["UPDATE row"] --> D["old version
to undo log"] D --> E["log truncates
when unreferenced"] E --> F["no vacuum
no freeze · 64-bit XID"] end classDef cell fill:#d3faf3,stroke:#0d9488,color:#0f5f57; classDef route fill:#fdeccb,stroke:#c2620a,color:#8a4708; classDef ctrl fill:#ece6fe,stroke:#7c3aed,color:#5a2bb0; classDef data fill:#dbe6ff,stroke:#2563eb,color:#1c47ab; classDef store fill:#e1e7ef,stroke:#475569,color:#33414f; class A,B,G,H store; class C,D,E,F cell;
Disaggregation: the cell holds no truth
A Tenant Cell is stateless compute. Its local NVMe is a cache, never the source of truth — truth is the WAL stream (the write-ahead log — an append-only journal of every change, written before the change is applied, so a crash can always be replayed) in the Journal plus materialized pages backed by S3. That separation is what lets a cell be killed, parked, or rescheduled and recover by replaying WAL. The write path is a single ordered hand-off; the read path is a page fetch keyed by log position (LSN).
On commit the cell only needs the Journal to durably accept the WAL by majority quorum — page materialization is asynchronous and lives downstream in the Pagestore, which replays committed WAL into versioned pages and spills them to S3. Reads ask the Pagestore for a specific page at a specific LSN; a local NVMe hit is fast, a miss rehydrates from S3. The cell's RAM still holds the hot working set plus the tenant's catalog and relcache while warm, so a steady-state query rarely leaves the box.
The alignment win: single-writer everywhere
OrioleDB's decoupled S3 storage is single-writer per namespace/bucket — usually cited as its big limitation, because a normal cluster wants many writers behind one dataset. Tessera is single-writer per tenant by design (decision 4: the tenant is the hard transaction boundary, one writer cell holds the epoch lease). So the objection never bites us: one cell, one ordered WAL stream, one storage namespace. The ambitious engine and the consistency model are the same shape, not two constraints fighting.
The honest tension
OrioleDB is beta, and its branching story is weaker
OrioleDB changes on-disk semantics and is not yet a settled, decades-proven format — adopting it is a real bet, not a free win. Its copy-on-write / branching / read-replica story is also thinner than Neon's heap-COW pageserver, where cheap branches fall out of page versioning almost for free. So v1 ships single-writer-per-tenant with async read-replica followers (page-LSN streaming or logical) as a fast-follow; full git-style branching is future work, not a launch feature.
Crucially, the bet is hedged at the architecture seam, not the application surface. Everything above — disaggregated WAL into a quorum Journal, a Pagestore that materializes versioned pages, S3 as bottomless truth, single-writer fencing (a guarantee that a deposed or stale writer is hard-blocked from the storage so only one cell can ever write — like changing the locks so the old keyholder is locked out) — is engine-agnostic. If OrioleDB blocks (a beta sharp edge, a format break, a missing capability), Tessera keeps the same disaggregated shape on stock heap plus a Neon-style pageserver. The only thing lost in that fallback is the headline "vacuum is literally gone": you would carry autovacuum and freeze again, with the wraparound clock back on the wall. That is the failure mode tracked under risks in ch15 — a graceful degradation to a proven design, not a redesign.
Ship in v1
Undo-log engine (no vacuum), disaggregated WAL→Journal→Pagestore→S3, single-writer-per-tenant, async read-replica followers as fast-follow.
Deferred
Full copy-on-write branching / git-style forks. Weaker than Neon today; revisit once OrioleDB's COW story matures.
Fallback cost
If OrioleDB blocks: same architecture on stock heap + Neon-style pageserver — but autovacuum, freeze, and wraparound return.
08 Consistency, durability & no split-brain
This is the reliability heart of Tessera. The locked stance is blunt: throughput may bend, but consistency and durability may not. The tenant-first frame is what makes that affordable — because each tenant is a single isolated Postgres cell with exactly one writer, the hard parts of distributed databases (a global snapshot oracle, distributed timestamp ordering, cross-shard commit) simply never appear at the tenant boundary.
The invariant
Reliability and consistency never bend; throughput is what we spend. Every mechanism below trades latency or concurrency to protect correctness — never the reverse.
Per-tenant serializable, for free
The tenant is the hard transaction boundary, and a tenant has a single writer — its one warm cell. A single Postgres process already ships Serializable Snapshot Isolation (SSI), the optimistic protocol that detects read/write dependency cycles and aborts the loser. With one writer there is exactly one logical timeline, so SSI gives true serializability and read-your-writes (a query always sees the writes you just made — no lagging replica can hand you a stale copy) with no coordination beyond what stock Postgres already does inside a single backend. We do not build a distributed snapshot oracle, a global clock, or an HLC — there is nothing to globally order. The cost of a conflict is a transaction abort the client retries; the cost we accept is that a tenant is capped at one writer's throughput (vertical, not horizontal). That cap is the deliberate price of "serializable for free."
Durability via Journal quorum
A commit is not acknowledged to the client until that transaction's WAL records (the write-ahead log — an append-only journal of every change, written before the change is applied so a crash can always be replayed) are majority-committed (more than half the replicas must persist the write before it counts — like requiring 2 of 3 signatures) across the Journal's replicas (the Neon-safekeeper / Aurora-quorum pattern: append to N, ack at ⌈(N+1)/2⌉). With three replicas, two durable copies must persist the LSN (log sequence number — a monotonic byte-offset that names a position in the WAL stream) before COMMIT returns. This survives the loss of any single Journal node and any single cell, because the truth of "what committed" lives in the quorum, not on the cell's local disk — the cell's NVMe/PVC is a cache, never the source of durability.
epoch N writer"] -->|append WAL @ LSN| E1["Journal r1"] A -->|append WAL @ LSN| E2["Journal r2"] A -->|append WAL @ LSN| E3["Journal r3"] E1 -->|ack| A E2 -->|ack| A A -.->|2 of 3 = majority, COMMIT returns| A classDef cell fill:#d3faf3,stroke:#0d9488,color:#0f5f57; classDef route fill:#fdeccb,stroke:#c2620a,color:#8a4708; classDef ctrl fill:#ece6fe,stroke:#7c3aed,color:#5a2bb0; classDef data fill:#dbe6ff,stroke:#2563eb,color:#1c47ab; classDef store fill:#e1e7ef,stroke:#475569,color:#33414f; class A cell; class E1,E2,E3 data;
Single-writer invariant & logical fencing
The whole scheme rests on "exactly one writer per tenant." Conductor enforces it with a monotonic epoch (term) lease — an ever-increasing version number on "who is the writer right now," much like a numbered ticket where only the highest number is honored: when it schedules a cell to own a tenant, it stamps that cell with epoch N. The Journal records the highest epoch it has seen for that tenant's stream and rejects any append carrying a stale epoch. On suspected cell failure — missed heartbeat, network partition, slow pod — Conductor does not need to physically kill the maybe-dead pod (which it may be unable to reach). It simply bumps the epoch to N+1 and hands the lease to a fresh cell. The moment the new cell does its first quorum-accepted write at N+1, the Journal's high-water epoch is N+1, and the old cell's appends at N are refused.
Logical STONITH
A zombie cell can keep running, keep believing it is the writer, even keep accepting client TCP — but it cannot commit, because the Journal fences its epoch. We get "shoot the other node in the head" semantics without the ability to physically shoot it. Split-brain is structurally impossible: two epochs cannot both be the high-water mark, so two cells cannot both append.
epoch N"] -->|append @ N| J["Journal
high-water epoch = N+1"] N2["Cell B
epoch N+1"] -->|append @ N+1| J J -->|ACCEPT| N2 J -.->|REJECT stale epoch| N1 D["Conductor
bumps epoch N to N+1"] -.->|issue lease| N2 classDef cell fill:#d3faf3,stroke:#0d9488,color:#0f5f57; classDef route fill:#fdeccb,stroke:#c2620a,color:#8a4708; classDef ctrl fill:#ece6fe,stroke:#7c3aed,color:#5a2bb0; classDef data fill:#dbe6ff,stroke:#2563eb,color:#1c47ab; classDef store fill:#e1e7ef,stroke:#475569,color:#33414f; class N1,N2 cell; class J data; class D ctrl;
Cold-start correctness
The fence prevents two cells from writing; one more rule prevents a freshly-woken cell from reading stale. A cell that wakes from cold-park (or takes over a fenced tenant) MUST replay WAL up to the Journal's quorum-committed LSN before it serves a single query. It pulls committed WAL from the Journal and pages from Pagestore, advances its applied-LSN to the durable high-water mark, and only then opens to clients (Gateway holds the connection meanwhile, per Chapter on routing). This one rule closes both failure modes at once: a resumed cell can never serve a read older than the last acknowledged commit (no stale read), and because it adopts the latest epoch on takeover, it can never resurrect a fenced predecessor's timeline (no split-brain).
What we forbid in v1
No cross-tenant atomic transactions
A transaction touching two tenants would need 2PC (two-phase commit — everyone first votes "ready," then a coordinator tells all to commit or all to abort, so partial commits can't happen) plus a distributed timestamp authority to order it against each tenant's local SSI timeline — exactly the complexity the tenant boundary buys us out of. Quarantined in v1. Cross-tenant work is application-level, eventually-consistent, and idempotent by contract.
Single region only
Quorum survives node loss, not whole-region loss — all three Journal replicas live in one region. Surviving a region outage means a per-tenant multi-region premium tier (cross-region quorum, higher commit latency), which is future work, never the v1 default.
The guarantee ledger
| Guarantee | How we get it | What it costs |
|---|---|---|
| Serializable (SSI) | Stock Postgres SSI inside the single writer cell — one logical timeline per tenant | Conflicting txns abort + retry; tenant capped at one-writer throughput |
| Read-your-writes | Same cell serves reads and writes; no replica lag within a tenant | No horizontal read fan-out inside a tenant (warm replicas are a later tier) |
| Durability | Majority-commit of WAL across 3 Journal replicas before COMMIT returns | Commit latency = slowest of the fastest 2 replicas (a network round-trip) |
| No split-brain | Monotonic epoch lease + Journal rejects stale-epoch appends (logical STONITH) | Conductor + consensus store (etcd/Raft) on the lease path; brief write-unavailability during an epoch bump |
| No stale read on resume | Waking cell replays WAL to quorum-committed LSN before accepting queries | Cold-start latency (WAL replay) added to first-query time; Gateway holds the connection |
| Cross-tenant atomicity | Forbidden in v1 — would require 2PC + distributed clock | Cross-tenant flows must be app-level, idempotent, eventually consistent |
| Region-failure survival | v1 single-region quorum survives node loss only | Whole-region loss needs the future multi-region premium tier |
What never bends
- Serializability + read-your-writes per tenant
- A commit is durable (majority-persisted) before it is acknowledged
- At most one writer can ever commit for a tenant
- No query is served against an un-replayed, stale timeline
What we spend
- Per-tenant write throughput (single writer, vertical scaling)
- Commit latency (one quorum round-trip)
- Cold-start latency (WAL replay before first query)
- Cross-tenant atomicity and region-failure survival (deferred)
09 Extensions: arbitrary, native, contained
The headline-hard requirement — let any tenant install any extension, including arbitrary native-C code — stops being terrifying the moment each tenant is its own isolated cell. A bad .so (a shared object — a chunk of compiled native code loaded into a running process, the Linux equivalent of a Windows .dll) can only break the one Postgres process that loaded it. Its blast radius is the tenant's own database, never a neighbor's. That single property is what turns "untrusted native code in production" from a non-starter into a routine feature.
Per-tenant isolation is the unlock
On shared Postgres, a custom C extension is shared-fate: a segfault, a memory stomp, or a runaway allocation takes down everyone on that instance. Because a Tessera Cell is one tenant's own OrioleDB-Postgres process inside its own sandboxed pod, "load arbitrary native C" degrades to "this tenant can crash their own DB" — a self-inflicted outage, fully recoverable by replaying WAL (the write-ahead log — an append-only journal of every change, written before the change is applied, so a crash can always be replayed forward) on a fresh cell. Arbitrary native C is finally safe to offer.
Two delivery paths
1 · Extension Catalog (fast-path)
Vetted, signed, pre-built popular extensions — pgvector, PostGIS, pg_cron, pg_stat_statements, and friends. The Conductor owns the catalog and its signing keys; a catalog hit means the cell loads a known-good binary with no cold compile. Most tenants only ever touch this path. Builds are keyed per-PG-major (see the ABI matrix below — the ABI is the binary contract a compiled extension and the Postgres it plugs into must agree on; mismatch it and the .so won't load).
2 · Raw upload (power-user path)
The tenant uploads an arbitrary native-C .so Tessera has never seen. It is scanned (signature/format checks), then loaded into the cell under the gVisor sandbox. This is the escape hatch for niche or in-house extensions — accepted precisely because containment, not trust, is what keeps it safe.
Install / load decision
extension"] --> Q{Catalog hit?} Q -- yes --> S["Load signed
pre-built build"] Q -- no --> R["Raw .so upload
scan + verify"] R --> G["Load in cell
under gVisor"] G --> X{Syscall
incompatible?} X -- no --> OK["Runs in gVisor cell"] X -- yes --> M["Escalate cell
to microVM"] S --> OK classDef cell fill:#d3faf3,stroke:#0d9488,color:#0f5f57; classDef route fill:#fdeccb,stroke:#c2620a,color:#8a4708; classDef ctrl fill:#ece6fe,stroke:#7c3aed,color:#5a2bb0; classDef data fill:#dbe6ff,stroke:#2563eb,color:#1c47ab; classDef store fill:#e1e7ef,stroke:#475569,color:#33414f; class U,S,G,OK,M cell; class R route; class Q,X ctrl;
Containment: the isolation onion
The default sandbox is gVisor — a userspace kernel (the "sentry") that intercepts the cell's syscalls (the calls a program makes to ask the OS kernel to do something privileged — open a file, send a packet, allocate memory) and services most of them itself instead of passing them to the host kernel. A malicious or buggy .so therefore cannot reach the host kernel, the Pagestore/S3 tier, the Conductor, or any other tenant. The only thing it can touch is the Postgres backend it lives inside — its own cell.
dlopens the extension into its own address space.The honest edge: gVisor doesn't implement every syscall
gVisor re-implements the Linux syscall surface in userspace — and it does not cover all of it. Exotic extensions that lean on io_uring, certain mmap/hugepage/direct-IO tricks (low-level, high-performance ways of talking to disk and memory that bypass the kernel's normal slower paths), or some perf-counter interfaces may run noticeably slower under emulation, or fail outright when they hit an unimplemented or restricted call. This is a real, known limitation, not a theoretical one — most extensions are fine, but the long tail is not.
The escape hatch: microVM escalation tier
When a tenant's extension trips gVisor's syscall surface — or simply demands maximum isolation — the Conductor reschedules that tenant's cell onto a Kata / Firecracker microVM: a lightweight VM running a real Linux kernel under hardware virtualization. Full syscall compatibility (it's an actual kernel, so io_uring and friends just work) plus stronger hardware-virt isolation — paid for in higher per-cell cost, lower density, and a slower cold start. gVisor stays the default for density and wake speed; the microVM is the compatibility-and-extra-isolation tier you escalate into, not the baseline.
| Dimension | gVisor (default) | microVM (escalation) |
|---|---|---|
| Isolation strength | Strong — userspace kernel, narrow host surface | Strongest — hardware-virt, real kernel boundary |
| Cold start | Fast — light sandbox, quick wake | Slower — boot a microVM kernel |
| Density | High — many cells per node | Lower — VM overhead per cell |
| Syscall compatibility | Partial — exotic calls slow or unsupported | Full — it's a real Linux kernel |
| Use as | Baseline for ~all tenants | Opt-in for exotic .so or premium isolation |
Operational guardrails
Containment is not only the sandbox. Three guardrails apply to every cell regardless of tier, so an arbitrary extension can't quietly become an exfiltration or cost problem.
Default-deny egress
A cell's network is closed by default. A crypto-miner has nowhere to dial; a data-exfil extension has no route out. Allow-listed destinations are opened explicitly per tenant.
Quotas + cost caps
Per-cell CPU/memory cgroup limits and spend caps. A runaway native loop OOM-kills or throttles inside its own cell — it self-limits, and the bill can't surprise you.
ABI / version matrix
An .so built against PG16's ABI won't load in PG17. The catalog is therefore per-PG-major, and a major-version upgrade forces a rebuild/relink of every native extension.
Net
Catalog for the 95% who want pgvector/PostGIS with zero compile; raw upload + gVisor for the power user; microVM for the exotic tail. Default-deny egress, quotas, and a per-major ABI matrix keep it honest. The cell boundary guarantees the worst case is one tenant breaking their own database.
10 Security: authn, authz, blast radius
Tessera's security model is a consequence of its architecture, not a layer bolted on top. Because every tenant is its own isolated Postgres cell — its own process, its own keys, its own network identity — isolation is not a feature you configure; it is the shape of the system. The job of this chapter is to make that containment explicit: who proves identity, who is allowed what, what a single breach can reach, and which component you must protect hardest.
Isolation is the product.
A shared-Postgres multi-tenant system spends enormous effort keeping tenants apart inside one process. Tessera never puts them together: there is no shared buffer cache, no shared catalog, no shared superuser, no shared WAL (the write-ahead log — an append-only journal of every change, written before the change lands, so a crash can be replayed). The default blast radius of almost every failure is exactly one tenant, because that is the only granularity the substrate knows about.
Defense in depth — five concentric layers
Each layer assumes the one outside it may have failed. An attacker who defeats the edge still hits a default-deny network; one who defeats the network still lands in a syscall-filtered sandbox; one who escapes the sandbox still reaches data encrypted under a key they do not hold.
Identity & authentication
The Gateway is the only component that ever sees a raw client connection. It terminates TLS, runs SCRAM-SHA-256 (challenge-response, so the password never crosses the wire), and resolves the tenant from the verified identity before routing. Enterprise tenants can layer mTLS (client presents a certificate) or delegate to an external OIDC identity provider — both validated at the edge so a cold cell never has to handle untrusted handshakes during wake. All credentials and signing material live in a secret store and arrive as injected environment values; nothing sensitive is baked into an image or committed to code.
Authorization & the no-superuser-surface win
Inside a cell, authorization is ordinary Postgres: per-tenant roles, GRANTs, row policies — but scoped to a database only that tenant occupies. The subtle, structural win is what is absent. In shared multi-tenant Postgres, a privileged hook (e.g. a pg_tle clientauth/passcheck callback) runs cluster-wide as superuser, so one tenant's malicious hook can intercept every tenant's logins. In Tessera each cell's superuser is scoped to that one isolated instance. There is no cluster-wide hook plane to hijack — the escalation target simply does not exist. The Conductor's own control-plane API is itself authenticated and least-privileged: it issues fencing leases (a time-limited token that crowns exactly one writer — like a baton only one runner may hold, so a stale ex-leader can't keep writing) and placement decisions, and nothing else can.
The path a request is forced to take
TLS + SCRAM
tenant resolve"] G -->|"default-deny
netpol"| C["Tenant Cell
gVisor pod"] C -->|"own stream only"| J["Journal
tenant WAL"] C -->|"getPage@LSN"| P["Pagestore
encrypted S3"] D["Conductor
lease + placement"] -.->|"fence epoch"| C classDef cell fill:#d3faf3,stroke:#0d9488,color:#0f5f57; classDef route fill:#fdeccb,stroke:#c2620a,color:#8a4708; classDef ctrl fill:#ece6fe,stroke:#7c3aed,color:#5a2bb0; classDef data fill:#dbe6ff,stroke:#2563eb,color:#1c47ab; classDef store fill:#e1e7ef,stroke:#475569,color:#33414f; class CL route; class G route; class C cell; class D ctrl; class J data; class P store;
Data at rest — crypto isolation
The Pagestore writes every page and WAL segment to S3 under per-tenant envelope encryption: a per-tenant data key encrypts the bytes, and that key is itself wrapped by a root key in a KMS. Decryption authority is therefore scoped per tenant. An object accidentally exposed, an S3 ACL misconfigured, or a backup copied to the wrong bucket all degrade to "ciphertext for one tenant" rather than "plaintext for everyone." Crypto isolation is the backstop for the failure modes that bypass the network and process layers entirely.
Threat register — what each attack can actually reach
| Threat | Containment | Blast radius |
|---|---|---|
| Malicious / buggy extension | gVisor syscall filter; microVM escalation tier; runs as the tenant's own scoped superuser | One tenant — no cluster-wide hook plane to abuse |
| Noisy neighbor (CPU / IO / memory) | Separate pods with K8s resource limits; one ordered WAL stream per cell; no shared buffer cache | One tenant — throttled, not co-resident |
| Single cell fully compromised | Default-deny netpol (own Journal + Pagestore only); no tenant-to-tenant routes; scoped creds + keys | One tenant's live data; cannot pivot laterally |
| Stolen tenant credentials | SCRAM/mTLS at Gateway; per-tenant roles; rotate + revoke at edge; optional OIDC re-auth | That one tenant, until credentials are rotated |
| Conductor compromise | Consensus-backed (etcd/Raft), audited, least-privilege; holds fencing + placement, never tenant plaintext | Worst case — fleet-wide control; protected hardest |
The crown jewel: protect the Conductor hardest.
The Conductor is the only component whose compromise is fleet-wide: it issues the single-writer epoch lease (the fence) and decides cell placement, so an attacker who owns it can mis-fence writers or schedule cells onto hostile nodes. It does not hold tenant data-encryption keys — but it holds the keys to the kingdom of orchestration. Mitigations are deliberately conservative: consensus-backed state (a majority of nodes must agree before anything counts as decided — like requiring 2 of 3 signatures, so no single machine can act alone) with no single-node authority, a least-privilege and fully audited control-plane API, and the smallest possible attack surface. We accept the rest of the system being "merely" well-isolated; the Conductor gets the paranoid budget.
What this buys you
Blast radius defaults to one tenant for every threat except control-plane compromise. No shared superuser, no cross-tenant network path, no shared plaintext at rest — three classic multi-tenant escalation routes are structurally absent.
Where the sharp edges are
The Conductor is a genuine single point of catastrophic failure and must be over-invested in. Stolen tenant creds still expose that tenant until rotation. gVisor narrows but does not eliminate kernel-escape risk — hence the microVM escalation tier for the riskiest extensions.
11 The Kubernetes object model
Tessera is not "Postgres that happens to run in a pod." It is a small set of custom resources reconciled by an operator. Two CRDs carry all desired state — Tenant (what the customer asked for) and Cell (a concrete running or parked instance of that tenant) — and the Conductor's job is to make the cluster match them. Everything else is a deliberate choice of which stock Kubernetes object best fits each plane's failure model.
The two CRDs
Tenant is desired state: plan (free / pro / dedicated), size class, region, and the extension set the tenant wants loaded. It changes rarely — when a customer upgrades a plan or adds pgvector. Cell is the materialization: a pod-shaped instance keyed by tenant_id, carrying its current phase (warm / parking / parked / waking), its epoch/term lease, and its scheduled node. The operator owns the Tenant→Cell mapping; humans only ever touch Tenant.
The Conductor is a Kubernetes operator — a reconcile loop watching Tenant and Cell objects — backed by a consensus store (etcd/Raft — a cluster of nodes that vote on every change so a majority always agrees on one official answer) that holds the things Kubernetes' own object store is too weak to hold safely: the single-writer epoch leases (the fence that guarantees exactly one writer per tenant), topology, and placement decisions. The K8s API server stores intent; etcd stores the facts that must be linearizable (every reader sees the latest write — no stale answers, as if there were one global clock).
plan · size · region · extensions"] CL["Cell CRD
phase · epoch lease · node"] D["Conductor
Operator + etcd"] RC{"RuntimeClass
gVisor / Kata"} P["Cell Pods
OrioleDB-PG · single-writer"] G["Gateway
Deployment + HPA"] J["Journal
StatefulSet · quorum"] PS["Pagestore
StatefulSet · NVMe cache"] F["S3
bottomless truth"] T --> D CL --> D D --> RC RC --> P D --> G D --> J D --> PS P --> J J --> PS PS --> F classDef cell fill:#d3faf3,stroke:#0d9488,color:#0f5f57; classDef route fill:#fdeccb,stroke:#c2620a,color:#8a4708; classDef ctrl fill:#ece6fe,stroke:#7c3aed,color:#5a2bb0; classDef data fill:#dbe6ff,stroke:#2563eb,color:#1c47ab; classDef store fill:#e1e7ef,stroke:#475569,color:#33414f; class T,CL,RC ctrl; class P cell; class G route; class J data; class PS,F store;
Why each object, not a generic one
Cells are operator-managed Pods — not a vanilla StatefulSet. (A StatefulSet is K8s' built-in way to run pods with stable, numbered identities and their own attached disks — think a fixed roster of named players rather than interchangeable workers.) Following the CloudNativePG pattern, the operator owns each pod's identity directly: it decides the name, the node, the lease epoch, and when the pod dies and is replaced. A StatefulSet's ordinal-and-template model is too rigid here — a tenant's cell is launched, parked to zero, rescheduled onto a different node, and woken under a fresh epoch, and the operator needs to drive each of those transitions explicitly while handing out the single-writer fence. The cell holds no durable local state (local NVMe is a cache; truth is Journal + S3), so the pod is disposable: kill it and it recovers by replaying WAL (the write-ahead log — an append-only journal of every change, written before it is applied, so any crash can be replayed forward).
RuntimeClass is the sandbox lever. Each Cell pod sets runtimeClassName — gvisor by default (the user-space kernel that intercepts syscalls for arbitrary native-C extensions), or kata for the escalation tier (a real microVM for extensions that trip gVisor's syscall surface or need maximum isolation). The Conductor picks the class per cell from the Tenant's extension set, so isolation strength is data-driven, not baked into one cluster-wide runtime.
Scale-to-zero is the operator's job, not an HPA's
Plain Horizontal Pod Autoscaling scales on CPU/memory metrics and assumes a load-balanced pool — wrong model for a single-writer cell. Tessera parks a cell when the Gateway signals "no active connections" for this tenant (the Knative-activator pattern): the operator scales that one pod to zero. Wake is connection-driven — a new client arrives, the Gateway holds the socket, and the operator must boot the pod and complete the epoch-lease handshake before traffic flows. That ordered wake + fence is something an HPA cannot express, so the operator owns park/wake end to end.
The durable planes get the stock objects whose semantics already match. The Gateway is a stateless Deployment + HPA — an always-on front door that genuinely is a load-balanced pool, exactly what HPA is for. The Journal is a StatefulSet: WAL durability needs quorum members (a write only counts once a majority of replicas agree — like needing 2 of 3 signatures) with stable network identity and durable PVCs (PersistentVolumeClaims — a pod's request for a disk that survives the pod being killed) so a restarted safekeeper rejoins as the same voter. The Pagestore is also a StatefulSet, its PVCs holding the NVMe page cache that fronts S3 — stable identity lets a restarted pagestore reuse its warm cache instead of re-fetching from object storage. The four control-plane workers — placement, park/wake, catalog + signing, and the orphan-claim reaper — are plain Deployments: stateless reconcilers that read truth from etcd.
| Component | K8s object | Scaling | Why this object |
|---|---|---|---|
| Tenant Cell | Operator-managed Pod | Scale-to-zero (operator park/wake) | Operator owns pod identity + epoch lease; disposable, replays WAL |
| Sandbox selection | RuntimeClass | Per-cell | gvisor default, kata escalation — set from Tenant extension set |
| Gateway | Deployment + HPA | Horizontal on load | Stateless front door; genuine load-balanced pool |
| Journal | StatefulSet | Fixed quorum members | Stable identity + durable PVCs for WAL voters |
| Pagestore | StatefulSet | By cache footprint | NVMe cache PVCs; stable identity keeps cache warm across restarts |
| Conductor | Operator + etcd | Leader-elected | Reconciles CRDs; etcd holds linearizable leases + topology |
| Control workers | Deployments | Horizontal | Stateless reconcilers (placement, park/wake, catalog+signing, reaper) |
Sharp edge: two stores of truth
Desired state lives in the K8s API server (the CRDs); the linearizable facts — epoch leases, who-is-writer-now — live in etcd. These can disagree during a partition: the API server might still show a Cell as warm while etcd has already revoked its lease. The fence resolves it safely — the Journal rejects stale-epoch writes regardless of what the CRD says — but the operator must always treat etcd, never the CRD status, as the authority on who currently holds the write lease.
12 The tradeoffs ledger
Every choice in this blueprint is a bet, and an honest bet has a bill. This chapter puts the whole bill on the table at once. The stance is not "we are fast and consistent and cheap" — that is a lie every database tells. The stance is narrow and deliberate: throughput and speed may be sacrificed; consistency and reliability may not. Tessera spends raw single-tenant write scale, cross-tenant atomicity, warm density, and tail latency on purpose, in exchange for guarantees it refuses to bend. What follows is the ledger — and, importantly, why it is a coherent bet rather than a pile of unrelated compromises.
The stance, restated
Consistency and reliability are load-bearing; throughput is the budget we pay out of. Every row in the ledger below moves in that one direction — never the reverse. If a future change asked us to trade a guarantee for speed, it would violate the bet, not refine it.
Why these spends are one bet, not seven accidents
The tenant-as-isolated-cell decision is the keystone, and almost every cost below is its shadow. A single-writer cell is what makes serializability free — and the same single writer is exactly why there is no intra-tenant write scale-out. Per-tenant Postgres processes are what make arbitrary native-C extensions and scale-to-zero possible — and the same per-process isolation is what makes warm density worse than schemas-in-one-database. Parking a cell to $0 is what crushes the long-tail bill — and the same parking is what creates the cold-start tail on the first query back. You cannot keep the wins and refund the costs; they are the front and back of the same coin. That is the test of a real design: the sacrifices are load-bearing, not lazy.
What never bends
- Strong per-tenant consistency — serializable (SSI — the strongest isolation level: transactions run as if one-at-a-time, never seeing each other's half-done work) + read-your-writes, one logical timeline per tenant
- Durability & reliability — quorum-committed (a majority of replicas must store a write before it counts — like needing 2 of 3 signatures) WAL (the write-ahead log — an append-only journal of every change, written before the change is applied so a crash can be replayed) before COMMIT acks, epoch fencing against split-brain (when two nodes both think they're the leader; fencing locks the stale one out, like revoking an old keycard so only the current holder can write), replay-before-serve on resume
- Hard isolation — one tenant per Postgres process, gVisor (a sandbox that intercepts a container's kernel calls and runs them in a userspace shim, so a compromised tenant can't reach the host kernel directly) — or microVM — sandbox, per-tenant crypto and keys
- Scale-to-zero — a parked tenant costs $0 of compute; idle is free, not discounted
- Arbitrary extensibility — vetted catalog and raw native-C
.soupload (a compiled shared-library file that loads straight into the database process — native machine code, not sandboxed script), because each cell is a real Postgres - Full-Postgres SQL surface — not a wire-compatible subset; the actual engine, the actual planner
What we spend
- Raw single-tenant throughput — one writer per tenant; no intra-tenant write scale-out in v1
- Cross-tenant ACID — forbidden in v1; cross-tenant work is app-level & idempotent
- Multi-region by default — single-region baseline; geo-redundancy is a per-tenant premium tier
- Warm density — isolated cells cost more than schemas-in-one-PG while warm
- Cold-start tail latency — the first query after a park pays the wake cost
- Operational complexity — many small instances + a real control plane to herd them
- Bleeding-edge risk — OrioleDB storage engine is beta; gVisor has syscall quirks
The ledger: what we sacrifice, to gain, and how we blunt it
A sacrifice with no mitigation is a flaw; a sacrifice with a named mitigation is a design. Each row pairs the cost with what it buys and the specific mechanism that keeps the cost from becoming a wound.
| We sacrifice | To gain | Mitigation |
|---|---|---|
| Intra-tenant write scale-out (single writer per tenant) | Serializable isolation + read-your-writes essentially free — one timeline, no distributed snapshot oracle | Vertical scaling per cell; future warm read-replica tier for read fan-out; per-tenant single-writer caps are rarely the bottleneck below the very largest tenants |
| Cross-tenant atomic transactions (forbidden in v1) | The tenant stays the hard transaction boundary — no 2PC (two-phase commit — the handshake that makes several databases commit or abort together; slow and failure-prone across machines), no distributed clock, no global ordering authority | App-level orchestration with idempotent, eventually-consistent contracts; cross-tenant flows are explicit and out of the commit path |
| Multi-region durability by default (single-region quorum) | Low, predictable commit latency — quorum is one local round-trip, not a cross-ocean one | Quorum survives any single node loss today; whole-region survival is an opt-in per-tenant premium tier (cross-region quorum, higher commit latency) |
| Warm density (isolated cells > schemas-in-one-PG while warm) | True isolation: per-tenant extensions, crashes, CPU, memory, and security blast-radius contained to one tenant | Scale-to-zero on the long tail — the many idle tenants cost $0, so the density penalty applies only to the warm working set, not the whole fleet |
| Cold-start tail latency (first query after park) | Parked = $0 compute; the long tail of idle tenants is free instead of perpetually provisioned | Gateway holds the client connection during wake (Knative-activator pattern — a proxy parks the incoming request, triggers a cold start, then forwards once the instance is live, so the caller just sees a slow first response); local NVMe page cache + Pagestore warm prefetch shrink replay; aggressive parking only on cells past an idle threshold |
| Operational complexity (many small instances + control plane) | Independent lifecycle, placement, and failure domain per tenant — no noisy-neighbor coupling | Conductor (K8s operator + etcd/Raft — Raft is a consensus protocol that keeps a cluster of nodes agreeing on one source of truth even as some fail, the way a committee follows a single elected chair) automates reconcile (the operator continuously nudges actual state toward the declared desired state), lease/fence, wake/park, placement, and the orphan-claim reaper; the complexity is concentrated in one well-tested plane, not smeared across tenants |
| Bleeding-edge storage risk (OrioleDB beta) | No vacuum, no freeze, no XID wraparound — vacuum is Postgres's background garbage collector reclaiming dead row versions; freeze and XID wraparound are the housekeeping that stops its 32-bit transaction counter from rolling over and corrupting data — all chores a parked tenant would otherwise still owe — the maintenance subsystem ceases to exist, which is what makes cold-parking truly free | S3 is the bottomless source of truth (engine-independent durable substrate); WAL replay rebuilds any cell; pin known-good versions, stage upgrades per-tenant, and keep a standard-Postgres fallback path open |
| gVisor syscall surface quirks (a syscall is how a program asks the kernel to do something privileged — open a file, send a packet; gVisor reimplements these, so a few rarely-used ones behave differently) | Default-strong sandbox for arbitrary native-C extensions without a full VM per cell | microVM escalation tier (Kata/Firecracker — lightweight virtual machines that boot in milliseconds, giving each cell its own real kernel instead of a shared shim) for extensions that need the real kernel ABI (the binary contract between a program and the kernel — exact syscall numbers, struct layouts, calling conventions) or maximum isolation — same cell contract, heavier sandbox |
The shape of the bet in three numbers
Where the bet could be wrong
Honesty cuts both ways. Three edges are worth naming plainly. First, OrioleDB is beta — the entire no-vacuum win rides on a storage engine that has not been battle-hardened at the scale a multi-tenant platform implies; the mitigation (S3 as engine-independent truth, a standard-Postgres fallback) is real but not free to exercise. Second, cold-start tail latency is a felt cost, not a hidden one — a user hitting a parked tenant after a long idle will wait for wake + replay; we make that wait correct and bounded, but we do not make it zero. Third, the single-writer cap is a hard ceiling for a genuinely write-heavy mega-tenant — vertical scaling and read replicas push it out, but a tenant that needs sharded writes is, by design, not a fit for v1. None of these are accidents. They are the agreed price of a database that will not lie to you about consistency or durability — and naming them is what makes the bet trustworthy.
The coherent bet
Spend throughput, density, breadth-of-geography, and a little maturity. Keep consistency, durability, isolation, extensibility, and the right to scale to $0. The spends all flow from the same keystone choice as the wins — which is exactly why this reads as one decision with a bill attached, not seven separate retreats.
13 Build vs. borrow
Here is the reassuring headline before the roadmap: almost nothing in Tessera is genuinely from scratch. The overwhelming majority is integrating battle-tested open source. The real novelty is concentrated in exactly one property — sandboxed self-service native extensions — plus the welding that fuses the proven parts into one tenant-first lifecycle. Every box below is sorted into one of three buckets.
| Component | Bucket | What you actually do | Reference / code to pull |
|---|---|---|---|
| Cell SQL engine | ♻️ Reuse | Adopt Postgres unchanged. | PostgreSQL |
| No-vacuum storage engine | ♻️ Reuse | Adopt it as the cell engine — you don't write a no-vacuum engine. | OrioleDB (PostgreSQL License) |
| Trusted-language extensions | ♻️ Reuse | Covers ~90% of extension needs for free, safely. | pg_tle (Apache-2.0) |
| Sandbox runtimes | ♻️ Reuse | Configure RuntimeClasses; you don't build a sandbox. | gVisor + Kata / Firecracker |
| Consensus / leases / topology | ♻️ Reuse | Store the epoch leases & cell placement. | etcd / the Kubernetes API |
| Operator framework | ♻️ Reuse | Scaffolding to build the Conductor on. | kubebuilder / controller-runtime |
| Object + local storage | ♻️ Reuse | Durable truth + NVMe cache. | S3 / MinIO · OpenEBS / Mayastor (CSI) |
| Auth & network primitives | ♻️ Reuse | TLS, SCRAM, OIDC, default-deny, encryption-at-rest. | cert-manager · SCRAM (in PG) · Cilium / Calico · KMS / Vault |
| Pooler base for the Gateway | ♻️ Reuse | The foundation you build the Gateway on top of. | PgBouncer / pgcat / Supavisor |
| Journal (WAL quorum durability) | 🔧 Inspired | Fork or reimplement; rework for per-tenant streams + OrioleDB WAL + the epoch fence. | Neon safekeeper (Apache-2.0) |
| Pagestore (page materialization) | 🔧 Inspired | Neon's pageserver replays stock-heap WAL; OrioleDB's format breaks that, so this goes custom (or build on OrioleDB's beta S3 mode). | Neon pageserver / OrioleDB S3 mode |
| Gateway (route + hold-during-wake) | 🔧 Inspired | Add the connection-hold + tenant routing + Conductor signalling on top of a pooler. | Neon-proxy / Supavisor pattern |
| Conductor (operator + lifecycle) | 🔧 Inspired | Build the Tenant/Cell CRDs + reconciler; borrow "operator owns pod identity." | CloudNativePG |
| Scale-to-zero (park / wake) | 🔧 Inspired | Wire park/wake to the epoch-lease + replay-before-serve rule. | Xata's CNPG scale-to-zero plugin (Apache-2.0) + Knative activator |
| Epoch / lease fencing protocol | 🔧 Inspired | Postgres SSI gives serializability free; you build the fence protocol around it. | Neon generation numbers / Raft fencing literature |
| Self-service native-C extensions, sandbox-isolated | 🏗️ From scratch | The upload → scan → sign → ABI/PG-version build → load-in-gVisor → syscall-compat probe → auto-escalate to microVM pipeline. Zero prior art — this is the moat. | — (gVisor / Kata are the only off-the-shelf inputs) |
| The tenant-cell lifecycle "weld" | 🏗️ From scratch | The exact state machine fusing park/wake + epoch-fence + storage-attach + replay-before-serve into one correct flow. The primitives are reusable; this integration is new. | — |
The novel surface is tiny — that's the good news
Only two boxes are red. The native-C-extension-in-a-sandbox pipeline is the one thing the open-source survey found zero projects doing (managed or not), and the lifecycle weld is bespoke glue over otherwise-proven primitives. Everything else is "assemble parts that already run in production." You are integrating, not inventing — which is exactly what makes a one-person "for fun" build tractable rather than a research moonshot.
Killing vacuum is what costs you the most reuse
Notice the trade hiding in the table: had you kept stock-heap Postgres, the Journal and Pagestore would largely drop to ♻️ Reuse — you could fork Neon's Apache-2.0 safekeeper and pageserver almost directly. Choosing the OrioleDB no-vacuum engine is what pushes the whole storage tier up into 🔧 custom, because Neon's pageserver assumes stock-heap WAL. The no-vacuum elegance is paid for in lost reuse — the same tension flagged in chapter 15 (risks), where the fallback is to keep this architecture on stock heap if OrioleDB blocks.
The roadmap, next, sequences this so the borrowed pieces land first (cheapest, highest-confidence) and the two from-scratch pieces come last — by which point everything they depend on is already standing.
14 How you’d actually build it
A blueprint is only buildable if it has a build order. This is the one we’d ship — phased so each step de-risks the next and proves a load-bearing claim before the next claim depends on it. The ordering principle is blunt: the riskiest, highest-ROL component goes first, and the first thing that is genuinely hard and genuinely reusable is WAL-durability-as-a-service (the WAL, or write-ahead log, is an append-only journal of every change — written before the change is applied, so a crash can always be replayed). Everything else (statelessness, scale-to-zero, extensions) is bought by the disaggregation that the Journal forces into existence.
First shippable slice = P0–P2
A single durable, stateless tenant: a Cell that writes through a quorum Journal (a quorum means a majority of replicas must acknowledge a write before it counts — like requiring 2 of 3 signatures), materializes pages in a Pagestore backed by S3, and can be killed and rescheduled onto another node where it replays WAL and resumes. That slice is a real product — one tenant, no scale-to-zero, no extensions, no fancy Gateway — and it proves the consistency + durability + relocatability core. Everything after P2 is leverage on top of a substrate that already works.
The staircase
Each phase below states what it de-risks — the single claim it exists to prove — so you can stop, ship, and learn at any step rather than betting the whole design on a big-bang integration.
Foundations — Cell, pass-through Gateway, one CRD
Build the Cell image: stock Postgres with the OrioleDB storage engine (undo-log MVCC + 64-bit XIDs — no vacuum, no freeze) running against a local disk. MVCC — multi-version concurrency control — means each writer makes a new copy of a row instead of overwriting it, so readers never block writers; the 64-bit transaction ids dodge the housekeeping (vacuum/freeze) that stock Postgres needs to avoid running out of 32-bit ids. Stand up a pass-through Gateway that just speaks the Postgres wire protocol (the byte-level language a Postgres client and server use to talk) and forwards bytes, plus one Tenant CRD that a hand-rolled controller turns into a pod. No durability story yet, no statelessness — just prove one tenant boots, serves SQL, and persists locally.
Journal — WAL-durability-as-a-service
The highest-ROI component, built first. The Cell stops trusting its local disk for durability and instead appends its single ordered WAL stream to a quorum Journal keyed by tenant_id; a commit acks only after majority-persist. Ship the epoch/term lease fence in the same phase — the Journal rejects appends from a stale epoch, which is what makes split-brain structurally impossible. If this phase is right, the rest of the system inherits correctness; if it’s wrong, nothing above it matters.
Pagestore + S3 — disaggregate the pages
Stand up the Pagestore: it ingests committed WAL from the Journal, materializes versioned pages, and serves getPage@LSN back to Cells — an LSN (log sequence number) is just a monotonic position in the WAL, like a byte-offset bookmark, so “page at this LSN” means “this page as of that exact point in history.” S3 becomes the bottomless source of truth; the Cell’s local NVMe is demoted to a pure cache. Now a Cell holds no durable local state — kill it, reschedule it onto any node, replay WAL to the quorum-committed LSN, and it resumes. This is the phase that makes the Cell truly stateless, which is the precondition for everything that parks or moves a tenant.
Gateway as a real control point
Promote the pass-through proxy into the routing plane: terminate TLS, do authn (SCRAM-SHA-256) + authz, and route each connection to the right tenant’s Cell. Add the trick that the whole economics later depends on — holding the client connection while a target Cell is unavailable (the Knative-activator pattern). At P3 the hold covers reschedule/replay; at P4 the exact same hold covers cold-wake.
Scale-to-zero — park & wake
The park/wake state machine, driven by the Conductor: idle tenants are cold-parked (the Cell pod goes away entirely; truth stays in Journal + S3), and the next connection wakes one. Add warm-catalog snapshots so a waking Cell restores relcache/catalog fast instead of cold-replaying everything, and pin a cold-start budget the Gateway-hold must stay under. No vacuum/freeze (P0) is what makes parking free — there’s no maintenance debt to settle before a tenant can sleep.
Extensions — catalog, sandbox, escalation
Ship the Extension Catalog (vetted, signed, pre-built popular extensions — pgvector, PostGIS — as a cold-compile-skip fast-path) plus a raw-upload path for arbitrary native-C .so files (a .so is a shared library — compiled machine code loaded straight into the database process, so a bad one can do anything the process can). Containment is the hard half: gVisor sandbox by default (gVisor wraps the code in a lightweight intercepting layer that fakes the operating-system calls, so it can’t touch the real host kernel), a microVM (Kata/Firecracker) escalation tier — a stripped-down throwaway virtual machine, stronger isolation than a container — for extensions that trip gVisor’s syscall surface, default-deny egress from every Cell, and a published ABI/version matrix so a binary built for one Postgres minor can’t silently load into another.
Conductor hardening
Move the epoch lease onto a real consensus store (etcd/Raft) — consensus is how a cluster of machines agrees on one answer even as some crash; Raft is a popular recipe for it, and etcd is a small datastore built on Raft — so the fence itself can’t split-brain, and add the orphan-claim reaper that bumps the epoch and re-leases a tenant whose Cell died mid-claim (a Railway-redeploy / OOM class of failure we’ve learned to plan for). Wire full observability — per-tenant sweep/park/wake state, lease churn, commit latency — and run chaos testing: kill Cells, partition the Journal, pause the Conductor, and assert no lost-commit and no split-brain.
Future — multi-region, replicas, sharding
A per-tenant multi-region premium tier (cross-region quorum, higher commit latency, opt-in — never the v1 default). Read replicas / branching built atop Pagestore’s versioned pages (a branch is a copy-on-write LSN view, not a data copy — copy-on-write means the branch shares the original pages for free and only makes a private copy of a page the instant something writes to it). And intra-tenant sharding for the rare tenant that outgrows a single writer — the deliberate escape hatch for the one assumption (one writer per tenant) we otherwise hold firm.
Why this order, in one picture
The dependency chain is almost linear: durability (P1) is what lets you trust a remote page store (P2); statelessness (P2) is what lets you park and reschedule (P4); a connection-holding Gateway (P3) is what hides both reschedule and cold-wake latency from the client; and only once a tenant is a disposable, sandboxable unit do arbitrary extensions (P5) become safe to offer.
local disk"] --> P1["P1 Journal
quorum WAL + fence"] P1 --> P2["P2 Pagestore + S3
stateless cell"] P2 --> P3["P3 Gateway
TLS + route + hold"] P3 --> P4["P4 Scale-to-zero
park / wake"] P4 --> P5["P5 Extensions
gVisor + microVM"] P5 --> P6["P6 Conductor
consensus + chaos"] classDef cell fill:#d3faf3,stroke:#0d9488,color:#0f5f57; classDef route fill:#fdeccb,stroke:#c2620a,color:#8a4708; classDef ctrl fill:#ece6fe,stroke:#7c3aed,color:#5a2bb0; classDef data fill:#dbe6ff,stroke:#2563eb,color:#1c47ab; classDef store fill:#e1e7ef,stroke:#475569,color:#33414f; class P0 cell; class P1 data; class P2 store; class P3 route; class P4,P5,P6 ctrl;
The sequencing risk to respect
Do not build the Gateway connection-hold (P3) or scale-to-zero (P4) before the Cell is genuinely stateless (P2). If a parked-then-woken Cell can still depend on local disk, “wake elsewhere” silently corrupts — the most expensive bug class to find late. P2 is the gate; nothing that relies on moving a tenant ships before it passes.
15 Where this is genuinely hard
A blueprint that only lists wins is a brochure. This chapter is the risk register: the seven places Tessera is actually hard, ranked by how much they can hurt, each paired with a mitigation we can build and — where one exists — a fallback that costs us a feature but not the architecture. Two are sharp enough to flag in red. The honest summary at the end: every constituent piece is proven in production; the novelty (and therefore the risk) is the combination under a strict tenant-first, consistency-first frame.
Sharpest edge 1 — OrioleDB is beta, and it owns the on-disk format
The "vacuum is gone" property comes from adopting an OrioleDB-style storage engine (undo-log MVCC + 64-bit XIDs — old row versions go to a separate undo log instead of piling up as dead rows that a background "vacuum" must later sweep, and the transaction counter is wide enough never to wrap around). OrioleDB is a real, working table access method (a pluggable backend that decides how Postgres physically stores rows on disk), but it is beta: it changes on-disk semantics across versions, its single-writer-to-S3 storage path is younger than stock Postgres, and its branching / read-replica story is weaker than Neon's mature copy-on-write (cheap instant clones that share storage and only diverge when written to — like forking a git branch). We are betting the cell's durable substrate on software that has not yet had a decade of production hardening. That is the single biggest "this could move under us" risk in the design.
Sharpest edge 2 — the Conductor is the reliability ceiling
The control plane holds fencing authority (the power to cut off an old writer so a stale node can't keep writing — like changing the locks so a fired keyholder's key stops working): it issues the single-writer epoch lease (a time-limited, numbered "you alone may write" grant, where each new grant carries a higher number so the old one is provably stale) that makes split-brain (two nodes both believing they're the live writer) structurally impossible (Ch. 07). That power cuts both ways — the entire system's reliability is bounded by the Conductor's own reliability. A Conductor that loses consensus, hands out two live leases, or can't bump an epoch promptly is the one component that can break the core invariant. This is not a place for clever code; it must be consensus-backed (etcd/Raft — a protocol where a majority of replicas must agree before any decision counts, so no single node can hand out a conflicting answer), boring, and battle-hardened. Operationally, reliability is won or lost here.
The risk register
| Risk | Severity | Mitigation | Fallback |
|---|---|---|---|
| OrioleDB maturity — beta engine; changing on-disk semantics; young single-writer→S3 path; branching/replicas weaker than Neon | High | Pin engine versions; version-gated migrations; conformance + crash-recovery test suite per release before any tenant moves | Keep the identical architecture on stock heap + a Neon-style pageserver. You lose only "vacuum is literally gone" — you'd reintroduce freeze-before-park (freezing is the maintenance pass stock Postgres must run to stop its transaction counter wrapping around; here you'd do it before idling a tenant). Everything else (cells, fencing, Journal, scale-to-zero) is unchanged. |
| Control-plane criticality — Conductor owns the fence; whole-system reliability is bounded by it | High | Consensus-backed lease store; small auditable control surface; chaos-test epoch bumps + Conductor failover; the fence is a durable Journal high-water mark, not in-memory state | None — there is no architecture without an authority for the single-writer lease. This component must simply be hardened, not designed around. |
| gVisor syscall compatibility — some native-C extensions won't run, or run slow, under the user-space syscall layer (gVisor is a sandbox that intercepts a program's kernel calls — syscalls — in software, so untrusted code can't touch the real host kernel; the cost is that some syscalls are unsupported or slower) | Medium | microVM (Kata/Firecracker — a lightweight real virtual machine with its own kernel, heavier than a sandbox but with full compatibility) escalation tier for extensions that trip gVisor; a compatibility test harness in the catalog pipeline classifies each extension gVisor-OK / needs-microVM / unsupported before it ships | Affected tenants run on the microVM tier. Residual cost: a heavier per-tenant footprint (full kernel) for those specific tenants — correctness intact, density worse. |
| Cold-start tail latency — catalog/relcache warm + WAL replay can make P99 wake painful for big-schema tenants | Medium | Periodic cell snapshots (skip full replay); partial / bounded WAL replay to the committed LSN (the WAL is the write-ahead log — an append-only journal of every change written before it's applied; replaying it up to a given LSN, a position marker in that log, rebuilds state after a cold start); keep-warm pool for predicted-hot tenants; Gateway holds the client connection during wake (Knative-activator pattern — a front door that parks the caller's request while the backend spins up from zero, then forwards it) | Tier large-schema or latency-sensitive tenants as always-warm (never park). They simply opt out of scale-to-zero; the cold path stays for the long tail. |
| Long-tail density economics — "millions of tenants" only pencils out if the tail truly parks to zero | Medium | Aggressive park policy; honest warm-set sizing; bin-pack warm cells per node; continuous measurement of park-ratio vs. revenue-per-tenant | Raise the price floor or the minimum tenant size so the unit economics hold even at a worse park-ratio. A pricing lever, not an architecture change. |
| Single-writer write ceiling — a tenant outgrowing one node's WRITE capacity has no in-tenant write scale-out in v1 | Medium | Vertical scale-up of the cell (bigger node); read fan-out via warm replicas (a later read-only tier) offloads reads, not writes | Out of scope for v1, flagged loudly. Intra-tenant sharding is a separate, heavier future project; until then a write-hot mega-tenant is steered to a dedicated large node. |
| Per-tenant catalog / DDL sprawl — catalog and schema changes are per-cell; metadata multiplies with tenant count | Low | Per-cell catalog is exactly what buys clean isolation; monitor catalog size + DDL rate as first-class metrics; cap pathological per-tenant object counts via quota | Inherent to the 1-tenant-1-Postgres model — accepted as the deliberate price of isolation, watched at scale rather than removed. |
Which risks threaten correctness vs. cost
It matters where each risk lands. Only one — the Conductor — can threaten the core consistency invariant; the rest threaten footprint, latency, or unit economics, which we are explicitly allowed to spend (Ch. 07's stance: throughput and cost may bend, consistency may not). The OrioleDB bet is unusual: it threatens neither correctness (the fallback engine preserves every guarantee) nor cost directly — it threatens maturity, which is a schedule-and-operations risk you buy down with testing and version discipline.
No free mitigations
Snapshots, partial replay, keep-warm pools, microVM escalation, bin-packing — these are listed as mitigations, not solved problems. Each is genuine engineering with its own operational surface. Treating them as "already handled" is how a credible blueprint quietly becomes an optimistic one.
Honest verdict
Pull the design apart and every load-bearing piece is individually proven in production: Neon ships safekeepers (the Journal pattern) + a pageserver (the Pagestore pattern) + scale-to-zero at scale; OrioleDB ships a working no-vacuum, undo-log storage engine; gVisor runs untrusted code under hostile multi-tenancy at Google scale; CloudNativePG and Knative prove the operator + activator model that the Conductor and Gateway lean on. None of these is speculative. The genuine novelty — and the genuine risk — is the combination: assembling them under a strict tenant-first, consistency-first frame where a tenant is one isolated cell, the writer is fenced by epoch, and the storage engine has erased vacuum. That synthesis has not been built before in exactly this shape. It is ambitious. But it is grounded: the parts are real, the invariants are precise, the fallback for the riskiest bet (OrioleDB) costs one feature and zero correctness, and the one component that can break consistency (the Conductor) is named, isolated, and assigned to be hardened. Ambitious, not hand-wavy.
16 Plain-language glossary
Every piece of jargon in this blueprint, gathered in one place — one plain sentence each, with an analogy where it sharpens. Skim it once, or jump back here whenever a term in an earlier chapter trips you up. Nothing here is new architecture; it is purely a decoder ring for the vocabulary the rest of the document leans on.
A reference, not required reading
This page exists to be looked up, not read front-to-back. Each row is self-contained — skip straight to the term you need.
Postgres internals
| Term | In plain language |
|---|---|
MVCC (multi-version concurrency control) | Each writer makes a fresh copy of a row instead of overwriting it, so readers never wait on writers and vice versa. |
WAL (write-ahead log) | Like a flight recorder: every change is written to an append-only log before it is applied, so a crash can always be replayed. |
LSN (log sequence number) | A monotonic byte-offset that names a position in the WAL — the page number of the change log, always counting up. |
| heap | The default on-disk layout where Postgres physically stores table rows, one after another in unordered pages. |
| dead tuple & bloat | An old row version that MVCC left behind; let enough pile up and the table balloons with garbage — "bloat" is the wasted space. |
| vacuum / autovacuum | The background janitor that sweeps away dead row versions to reclaim space; autovacuum is just Postgres running it for you automatically. |
| freeze | A vacuum chore that stamps very old rows as "permanently visible" so their transaction numbers can be safely recycled. |
| XID wraparound (32-bit vs 64-bit) | Postgres numbers transactions with a 32-bit counter that eventually loops back to zero — forcing freeze work; a 64-bit counter is so wide it never wraps. |
| undo-log MVCC | An alternative where old row versions go to a separate "undo" log instead of cluttering the table — so there is nothing for vacuum to sweep. |
| snapshot / visibility / ProcArray | A snapshot is each transaction's frozen view of "which rows count right now"; the ProcArray is the in-memory list of live transactions used to decide that visibility. |
clog (commit log) | A tiny on-disk ledger recording whether each transaction committed or aborted — the lookup table that tells readers if a row's writer actually succeeded. |
| buffer manager | Postgres's in-memory cache of disk pages; it decides what stays in RAM and what gets read from or written back to disk. |
| system catalog / relcache | The catalog is Postgres's own tables describing your tables, columns and indexes; the relcache is the hot in-memory copy of that metadata. |
| postmaster & backend | The postmaster is the parent process that accepts new connections; a backend is the one child process it spawns to serve a single connecting client. |
| shared_preload_libraries / hooks | A startup list of native extensions Postgres loads into itself; "hooks" are the internal callback points those extensions splice into to change core behavior. |
| table access method (TAM) | A pluggable backend that decides how Postgres physically stores rows on disk — swap it and you change the storage engine without changing SQL. |
getPage@LSN | A request that asks the storage layer for the exact version of a page "as of" a given log position — point-in-time page fetch by timestamp. |
fsync | The system call that forces buffered data all the way down to physical disk, so a power cut can't lose a write the OS only pretended to save. |
| Postgres wire protocol | The byte-level language a Postgres client and server speak over the network — the handshake and message format any tool must talk to look like Postgres. |
Consistency & durability
| Term | In plain language |
|---|---|
| serializable / SSI | The strongest correctness level — transactions behave as if run one at a time; SSI is Postgres's optimistic way of getting it by aborting whichever conflicting transaction loses. |
| isolation levels | The dial for how much one in-flight transaction can see of another's uncommitted work — from loose-and-fast down to strict serializable. |
| read-your-writes | A guarantee that a query always sees the writes you just made — no lagging copy can hand you back a stale version of your own data. |
| durability | Once a commit is acknowledged, it survives crashes and restarts — the "D" in ACID; written in stone, not just in memory. |
| quorum / majority commit | A write counts only once more than half the replicas have stored it — like requiring 2 of 3 signatures before a cheque clears. |
| consensus (Paxos / Raft) | A protocol letting several machines agree on one value despite failures; Paxos and Raft are the two famous recipes, Raft being the more readable one. |
| two-phase commit (2PC) | A way to commit across machines: everyone first votes "ready," then a coordinator tells all to commit or all to abort — so no one commits halfway. |
| single-writer | The rule that exactly one process may write to a given tenant at a time, which dissolves most distributed-database hard problems before they appear. |
| epoch / lease | A time-bounded grant of "you are the writer right now," tagged with an ever-increasing epoch number — like a numbered ticket where only the highest is honored. |
| fencing / STONITH | Cutting off a suspect old writer so it can't corrupt data; STONITH ("shoot the other node in the head") is the brute-force version, here done logically by rejecting its epoch. |
| split-brain | The failure where two nodes both believe they're in charge and both write — the disaster fencing exists to make structurally impossible. |
| timestamp ordering | Using clocks or counters to put operations into one agreed sequence across machines — the global-ordering machinery a single-writer tenant gets to skip. |
Isolation & sandboxing
| Term | In plain language |
|---|---|
| gVisor | A lightweight sandbox that intercepts a program's system calls in user space, walling off untrusted code from the real host kernel — a software airlock. |
| microVM (Kata / Firecracker) | A tiny, fast-booting virtual machine running a real Linux kernel under hardware isolation — heavier than a sandbox, but a genuine wall instead of a fence. |
| syscall | A request a program makes to the operating-system kernel to do something privileged — read a file, send a packet — the boundary sandboxes police. |
shared object (.so) / ABI | A .so is a chunk of compiled native code loaded into a running process (Linux's .dll); the ABI is the exact binary contract it must match to plug in. |
| envelope encryption | Encrypting data with a per-item key, then encrypting that key with a master key — like locking each box, then locking all the box-keys in one safe. |
| default-deny egress | Outbound network traffic is blocked unless explicitly allowed — guilty until proven innocent, so leaked code can't phone home. |
| io_uring / hugepages / direct IO | Low-level, high-performance ways of talking to disk and memory that bypass the kernel's normal slower paths — fast lanes that some sandboxes can't fully emulate. |
| copy-on-write / branching | Sharing one copy of data until someone writes, then splitting off only the changed bits — cheap instant clones, like forking a git branch. |
Kubernetes & infrastructure
| Term | In plain language |
|---|---|
| CRD (custom resource definition) | A way to teach Kubernetes a new kind of object so you can declare, say, a "Tenant" the same way you declare a built-in pod. |
| operator / reconcile | A program that watches your declared objects and continuously nudges reality to match them — a robot that keeps re-reading the spec and fixing drift. |
| StatefulSet | A Kubernetes controller for pods that keep stable identities and their own storage — the right shape for databases, not interchangeable workers. |
| Deployment | A Kubernetes controller for stateless, interchangeable pods you can freely add, kill, or replace — the right shape for web servers and gateways. |
| HPA (horizontal pod autoscaler) | The knob that adds or removes pod copies automatically as load rises and falls. |
| PVC (persistent volume claim) | A pod's request for a piece of durable disk that outlives the pod itself — a reserved locker that survives the tenant being moved. |
| RuntimeClass | A label that tells Kubernetes which sandbox or VM to boot a pod inside — the switch that picks gVisor versus a microVM. |
| etcd | Kubernetes's small, consensus-backed key-value store holding the cluster's source-of-truth state — the strongbox the whole cluster trusts. |
| object store / S3 | Cheap, effectively bottomless cloud storage addressed by key rather than by filesystem path — a giant cloud bucket; S3 is the original. |
| NVMe | The fast, locally-attached flash-disk standard — the cell's high-speed scratch drive, used as a cache rather than the source of durability. |
| Knative activator | A component that holds an incoming request while a scaled-to-zero service wakes up, then forwards it — the doorman who stalls the guest until the room is ready. |