Architecture

3 ways of running

rivetkit

rivetkit is the typescript library used for both local development & to connect your application to rivet
a rivetkit instance is called a “runner.” you can run multiple runners to scale rivetkit horiziotnally. read omre about runners below.

local development

in local development, rivetkit provides a full actor environment for single-node deployments

drivers

rivetkit supports multiple drivers. currently supports: file system (default in local dev), memory, rivet engine (used for rivet cloud & self-hosting), cloudflare durable objects (does not rely on rivet engine)
drivers are very flexible to enable you to write your actors once and plug in to any system that fits your architecture adequately
see the driver interface
- actordriver https://github.com/rivet-dev/rivet/blob/eeb01fc4d9ca0e06f2e740d267bd53280ca7330e/rivetkit-typescript/packages/rivetkit/src/actor/driver.ts
- managerdriver https://github.com/rivet-dev/rivet/blob/eeb01fc4d9ca0e06f2e740d267bd53280ca7330e/rivetkit-typescript/packages/rivetkit/src/manager/driver.ts

rivet cloud

provides multi-region and highest performance out of the box
accessible at dashboard.rivet.dev and the api is avialble at api.rivet.dev

rivet self-hosted

available as a standalone rust binary or a docker contianer
can be configured ot persist to postgres or rocksdb
can scale horiziontally across multipe nodes and can scale across multiple regions
see self-hosting docs (link to docs)

actors

Actors for long-lived processes with durable state, realtime, and hibernate when not in use. read more about actors at a high level at (link to actors/index)

actor-per-entity

actors are designed to have an actor-per-entity
you can think about actors a bit like objects in object-oriented programming where ach is responsible for their own state and expose methods (ie actions in our case)
examples incldue
- actor per user
- actor per user session
- actor per document
- actor per game room
- actor per tenant
- actor per rate limit topic

architecting for scale

actors scale by:
- having isolated state to each acotr that combines compute and storage for in-memory reads and writes
- communication is stndardized based on actions & events
- scale horizontally
read more about scalign at (link to scaling doc)

horizontal scaling

actors can run across multiple rivetkit runners. this is orchestrated by rivetkit itself.

lifecycle

actors have create, destroy, wake, and sleep lifecycle hooks that you can implement to modify behavior. see the lifecycle docs for reference on actor lifecycel hook sequences

actor sleeping

actors sleep when not in use
an actor is considered not in use when there are no active network connections to the actor (or the network connections are hibernatable websockets, see below) and there are no actions in flight
actors have a sleep timeout (configured in options.onSleepTimeout) that decides how long to keep the actor in memory with no recent actions
sleep can be disabled with options.noSleep

wake events

actors can wake to any of the follwoing events:
- network requests
- websocket messages
- alarms (see scheduling docs)

live actor migration

live actor migrations lets your application ugprade, crash, or hot reload cahnges without interruption to your user or application (including websockets)
this is powered by hibernating websockets for live websocket migraiton & our fault tolerance mechanism (read more below)

coldstart performance

actors have negligible coldstart performance. the code to run the actor is already started (ie the runner), so creating/starting an actor is incredibly cheap.
creating new actors with a key requires some overhead to communicate with other regions in order to reserve the actor’s key (see below). actors can be created without keys with near-0 latency.

multi-region, globally unique actor keys

acotrs can optionally have a globally unique “key”
when creating an actor with a key
this system is highly optimized to reduce wan round trips using a process call EPaxos with a custom database called Epoxy (link to https://github.com/rivet-dev/rivet/tree/main/engine/packages/epoxy)
limitation: when creating an actor with a given key, that key will always be pinned to that region even if the actor is destroyed. creating a new actor with the same key will always live in the same region.
see the acotr keys document

input

actors have input data that can be passed to them when constructed
this is similar to apssing data to a constructor in an object

generic parameters

actor definitions include the following generic parameters that you’ll see frequently in the code:

state
conn state
conn params
ephemeral variables
input data
(experimental) database connector

persistence

state automatically flushes to storage intelligently
to force a state flush and wait for it to finish, call (TODO: look this up in state document)
read more about state persistence in the state document (link to document)
state is stored in the same place as where the actor lives. loading an actor in to memory has comparable performance to network attached storage, and once in memory, has performance of any standard in-memory read/write like a variable.

scheduling & alarms

actors have a scheduling api to be able to wake up at any time in the indefinite future
think of this like setTimeout but without a max timeout
rivet is responsible for waking the actor when this timeout wakes

ephemeral variables

actors have the ability to create ephemrla variables for things that you do not want to be persisted with the actor’s state
this is useful for non-serializable data like a utility class like a pubsubs erver or something (TODO extra info)
link to ephemeral variables docs

actions

for stateless clients, actions are sent as http requests via POST /gateway/{actor id}/actions/{action name}
for stateful clients, actions are sent as websocket messages

events & subscriptions

events are sent as websocket messages

error handling

this is different than fault tolerance:
- error handling is a user error
- fault tolerance is something goes wrong that your applciation was not built to handle (ie hard crash, oom, network error)
rivet provdies a special UserError class to throw custom errors that will be returned to the client
all other errors are returned as a generic “internal error”
this is becuase leaking error deatils is a common security hole, so we default to expose-nothing errors

logging

rivet uses pino for logging
we expose a scoped child logger for each actor at c.log that automatically logs the actor id + key
this allows you to search lgos easily by actor id without having to log the actor id frequently
logs can be configured via the RIVET_LOG_LEVEL env var

fault tolerance

actors are fault tolerant, meaning that the host machine can crash and the actors will proceed to operate as if nothing happened
runners maintain a socket with rivet engine. when this socket closes or takes to long to ping, actors will reschedule
hibernating websockets (enabled by default) will live-migrate to the new actor as if nothing happened

crash policy

there are 3 crash policies: sleep, restart, and destroyed
- sleep (default, usually the option you want):
  - when to use: actors that need high-performance in-memory logic.
  - when not to use: you need this actor running at all times no matter what, even if idle
  - examples: (list commone xamples)
- destroy:
  - when to use: actors that need to run once until completion. on crash, do not try to reschedule.
  - when not to use: if you want your actor to have fault tolerance and be able to run transaprenlty to the underlying runner
  - examples: batch jobs, image conversions, ephemeral jobs, (TODO come up with better eaxmples)
- restart:
  - when to use: actors that should be running at all times
  - when not to use: if you don’t absolutely need something running at all times, since this consumes needless compute resources. considure using the scheduling api instead.
  - examples: maintain outbound sockets, daemons, always-running jobs, (TODO come up with better examples)

the behavior for each is described below:

Event	Restart	Sleep	Destroy
Graceful exit (StopCode::Ok)	Destroy	Destroy	Destroy
Crash (non-Ok exit)	Reschedule	Sleep	Destroy
Lost (runner disappeared)	Reschedule	Sleep	Destroy
Lost + force_reschedule	Reschedule	Reschedule	Reschedule
GoingAway (runner draining)	Reschedule	Sleep	Destroy
No capacity (allocation)	Queue (wait)	Sleep	Queue (wait)
No capacity + serverless	Queue (wait)	Queue (wait)	Queue (wait)
Wake signal (while sleeping)	Reschedule	Reschedule	Reschedule

inspector

actors provide an inspector api to implement the:
- repl
- state read/write
- network inspector
- event log
this is impelmented over a websocket over bare

http api

see the http api document on actors

multi-region

actors can be scheduled across multiple regions
each actor has an actor id which embeds which region it lives in
networking is automatically routed to the region that an actor lives in
limitation: actors curretnly cannot migrate across regions

backpressure

no runner capacity

this is how actors with different crash policies behave when when there’s backpressure:
- sleep = sleeps (sheds load by not rescheduling)
- restart = queues
- destroy = queues
see the above matrix for more details on actor crash policy on how it handles no capacity.
the actor queue is built to withstand high amounts of backpressure on rivet, so queueing actors is fine here
a large queue means it’ll take more time for your application to process the queue to catch up with demand when it comes online.

per-actor cpu & networking exhaustion

actors are isolated, so they each have their own individual bottleneck. you can think of this like a process thread where each thread can only do so much.
there is no durable message queue/“mailbox” for actors. if the actor cannot respond in time, then the request is dropped.
if an actor exhauses its cpu or networking, then the runner
returns service unavailble (503) if the actor fails to respond to a request in time
there is no hard cap on the networking or cpu usage for each actor at the moment
if your actor is resource intensive, it’s common to use a separate mailbox actor to act as a queue

runners

regular vs serverless runners

there are 2 types of runners:

regular: these are standard nodejs processes connected to rivet that rivet can orchestrate actors to and send network requests to at any time
serverless: rivet works with serverless platforms. when an actor is created, it has a request-per-actor model where it opens a long-running request on the serverless platofrm to run a given actor.

runner pool

runners are pooled together by sharing a common name (ie “default”)
when an acotr is created, it chooses the pool by selecting the runner name to run on
rivet will automatically load balance actors across these runners

runner key

not relevnat for serverless runners
each runner has a unique key that it provides when connecting. keys are unique to the instace the runner is running on and should be the same if the runner is restarted.
this can be the: machine’s ip, k8s pod name, etc
if there is an existing runner connected with a given key, the runner will disconnect the old runner and replace it
rivet is designed to handle network partitions by waiting for runners to miss a ping, indicating it’s no longer alive. however, often times runners restart immediately after a hard crash and reconnect. in this case, the runner will reconnect on restart and terminate the old runner in order to prevent further actors from scheduling to the crashed runner.

capacity

not relevnat for serverless runners
each runner can be assigned a capacity of how many actors it can run
rivet will schedule with spread (not binpacking) in order to spread load over actors

usefulness of capacity = 1

setting a capacity of 1 is helpful for situations where you have cpu-intensive apps that should not run with any other actors
examples include game servers, ffmpeg jobs, etc

versions & upgrading code

each runner has a version index
actors are always scheduled to the highest verison index (see runner priority below)
this means that when a new runner is deployed:
1. runners with higher index come online
2. actors schedule to the highest index, stop scheduling to the older index
3. old index runners start draining and migrating actors to new index
4. all old runners are now shut down
websocekts are live migrated to the new version when upgrading using hibernating websockets to users see no hiccup in their applications
this is important because actors should never downgrade their runner. they should always move to a newer version of code in order to prevent corruption.

runner scheduling prioroty

actors are scheduled to runners sorted by priority of (version DESC, remaining capacity ASC)

multi-region

TODO

shutdown sequence

runner shutdown is important to ensure that actors do not get unexpectedly terminated when either:
- upgrading your applciation and taking down old pods
- scaling down your runners horizontally (ie from an hpa)
- pressing ctrl-c when in development
on shutdown:
1. tell rivet the runner is stopping
2. rivet tells all the actors on this runner to migrate
3. runner waits for all actors to finish migrating
4. runner exits process

reconnection

runners can handle temporary network partitions
they’ll automatically reconnect and replay missed commands/events between rivet and the runner
this happens transparenlty to the user
if disconnected for too long (indicating a network partition), the runner will shut itself down and exit

autoscaling

not relevant to serverless
runners currently autoscale on cpu. more intelligent scaling is coming soon.
tune your runner total slots capacity accordingly
it’s up to you to configure your hpa/etc to work like this. see the Connect guides (link to index page) for reference on hwo to configure this.

serverless timeouts

serverless runners take in to account the maximum run duration of the serverless platform
the runners will mgirate actors to a new request before the request times out
this is completely transparent to you and the user because of the fault tolerance and websocket migraiton characteristics
it’s common for actors to go sleep before hitting the serverless timeout

networking

web standards

everything in rivet is built on webstandards by default
nothing in rivet requires you to use our sdk, our sdks are meant to be a convenience. it’s built to be as easy to use raw http/websocket endpoints from scratch.
actions, events, etc are all built on simple, well-documented http/websocket under the hood (link to openapi & asyncapi docs).
you can use low-level request handlers (lnk to dock) and low-level weboscket handlers (link to doc) to handle low-level primtivies yourself

encoding

rivetkit’s action/events api supports communicating via [VBARE](link to github repo, see the blog post for the link), CBOR, or JSON
- VBARE: high-perofrmance & compact, optimal use case
- CBOR: descent encoding/decoding perf + portable libraries, good for implemnting high-ish performance on other platforms
- JSON: good for fast implementations & debugging (easy to read)

tunneling

when a runner connects it opens a tunnel to rivet to allow incoming traffic
- this is simila to systems like tailscale, ngrok, or other vpns
- we do this for security & configuraiton simplicity since it means that you don’t have to manage exposing your rivetkit applications’ networkig to rivet. instead, anything that can open a socket to rivet can accept inbound traffic to actors.

gateway

incoming traffic to actors come to the Rivet gateway and are routed to the appropriate runner
the rivet gateway automatically handles:
- multi-region routing to route traffic to the correct reigon for an actor
- automatically waking the actor if needed
- sending traffic over the runner

hibernating websockets

hibernating web sockets are a core component of live actor migration & fault tolerance. it allows us to maintain an open websocket while the actor crashes, upgrades, or moves to another runner. TODO: copy the rest of this from low-level webosckets document and rephrase

actor health endpoin

actors provide a simple, utility health endpoint at /health that lets you check if your actor is reachable (e.g. curl https://api.rivet.dev/gateway/{actor id}/health)

multi-region

networking

actors may live in different regions than inbound requests
rive tuses the Epoxy (link again) system to handle global routing to route traffic to the correct region with high performance
this is completely transparent to you. your app sends traffic to https://api.rivet.dev/gatewa/* and it automatically routes to the correct actor in the appropirate region

globally unique actor keys

acotr keys are globally unique to be able to benefit from multi-region capabilities without any extra work
see more about globally uniuqe actor keys above
see the actor keys document

regional endpoints

each reigon has a regional endpoint
this endpoint is used specifically for connecting runners (for example https://api-us-east-1.rivet.dev), opt to use api.rivet.dev for all other traffic
runners are required to connect to the regional endpoints
- this is because runners are sensitive to latency to the rivet regional datacenter
- we add datacenters regularly so each runner needs to be pinned to a single datacenter in order to ensure your availble datacneter list doesn’t change sporadically without your consent

persistence

data is always persisted in the same region that is written
this is important for minimal coldstarts & data locality laws

namespaces

rivet provides namespaces to run multiple actor systems in isolation
this makes it really easy to have prod/staging environments or completely different applications running on the same rivet instance
when you connect to rivet, you can specify which namespace you’re connecting to
self-hotsed rivet defaults to namespace “default”
rivet cloud provdies isolated tokens for each namespace

manager api

rivet provides a standard rest api for managing actors
- useful endoints include:
  - get /actors
  - delete /actors/
  - get /actors/names -> get all actor types available

comparison to prior art for actors

runtime

there are very few serious actor implementation targeted at the javascirpt eocsystem. rivet is arguably the most serious open-source actor implementation for typescript out there.

library vs orchestrator

some actor systems opt to be purely a library while rivet opts to have an orchestrator (i.e. the single rust binary)
this lets us to a lot of things other actor systems can’t:
- separating orchestration, persistence, and proxy lets us isoalte the core to be incredibly reliable while the fast-changing applications that ocnnect to rivet can be more error-prone safely. with a library, the blast radius of your application also affects the entire actor system.
- support for serverless platforms to benefit from cost, multi-region, blitz scaling, and relibaiblity benefits
- optimize fault tolerance since we can make more assumtions about application state when the rivet core does not crash and your app does

scheduling

actors is a loose term, but there are generally 2 types of schedulign in practice:

ephemeral actors
- examples: erlang/otp, akka, swift
- provides no persistence or sleeping mechanism by defualt
- relies on supervisors for managing persistence
virtual actors
- an extension of the actor pattern that provides actors that can hibernate (“sleep”) when not in use
- examples: orleans, dapr, durable objects

rivet has similarities with both to provide more flexibility:

crash policies provdie for 3 types of actors:
- sleep -> most similar to virutal actors
- restart -> most similar to ephemeral actors but with a supervisor to auto-restart, however still has a durable queue ot handle backpressure
- crash -> most similar to traditional actors but with no supervisor to restart, however still has a durable queue to handle backpressure

communication

many actor frameworks use inbox patterns (think: queue-per-actor) to handle sending messages between actors
there is no callback mechanims, instead you need to send messages back to the actual actor
rivet opts to behave like web standards instead of using the message pattern
- actors can impelment the inbox pattern optionally
- but we provide lower-level networking to be able to be compatible with more techniologies
- rivet assumes the same serial concurerntly that other actors do (by the nature of javascript being single-threaded) but we allow you to run promises in parallel or handl eyour own concurrency control (which some other actor frameworks might require a spawning new actor to do)