Architecture
3 ways of running
rivetkit
- rivetkit is the typescript library used for both local development & to connect your application to rivet
- a rivetkit instance is called a “runner.” you can run multiple runners to scale rivetkit horiziotnally. read omre about runners below.
local development
- in local development, rivetkit provides a full actor environment for single-node deployments
drivers
- rivetkit supports multiple drivers. currently supports: file system (default in local dev), memory, rivet engine (used for rivet cloud & self-hosting), cloudflare durable objects (does not rely on rivet engine)
- drivers are very flexible to enable you to write your actors once and plug in to any system that fits your architecture adequately
- see the driver interface
- actordriver https://github.com/rivet-dev/rivet/blob/eeb01fc4d9ca0e06f2e740d267bd53280ca7330e/rivetkit-typescript/packages/rivetkit/src/actor/driver.ts
- managerdriver https://github.com/rivet-dev/rivet/blob/eeb01fc4d9ca0e06f2e740d267bd53280ca7330e/rivetkit-typescript/packages/rivetkit/src/manager/driver.ts
rivet cloud
- provides multi-region and highest performance out of the box
- accessible at dashboard.rivet.dev and the api is avialble at api.rivet.dev
rivet self-hosted
- available as a standalone rust binary or a docker contianer
- can be configured ot persist to postgres or rocksdb
- can scale horiziontally across multipe nodes and can scale across multiple regions
- see self-hosting docs (link to docs)
actors
- Actors for long-lived processes with durable state, realtime, and hibernate when not in use. read more about actors at a high level at (link to actors/index)
actor-per-entity
- actors are designed to have an actor-per-entity
- you can think about actors a bit like objects in object-oriented programming where ach is responsible for their own state and expose methods (ie actions in our case)
- examples incldue
- actor per user
- actor per user session
- actor per document
- actor per game room
- actor per tenant
- actor per rate limit topic
architecting for scale
- actors scale by:
- having isolated state to each acotr that combines compute and storage for in-memory reads and writes
- communication is stndardized based on actions & events
- scale horizontally
- read more about scalign at (link to scaling doc)
horizontal scaling
- actors can run across multiple rivetkit runners. this is orchestrated by rivetkit itself.
lifecycle
actors have create, destroy, wake, and sleep lifecycle hooks that you can implement to modify behavior. see the lifecycle docs for reference on actor lifecycel hook sequences
actor sleeping
- actors sleep when not in use
- an actor is considered not in use when there are no active network connections to the actor (or the network connections are hibernatable websockets, see below) and there are no actions in flight
- actors have a sleep timeout (configured in options.onSleepTimeout) that decides how long to keep the actor in memory with no recent actions
- sleep can be disabled with
options.noSleep
wake events
- actors can wake to any of the follwoing events:
- network requests
- websocket messages
- alarms (see scheduling docs)
live actor migration
- live actor migrations lets your application ugprade, crash, or hot reload cahnges without interruption to your user or application (including websockets)
- this is powered by hibernating websockets for live websocket migraiton & our fault tolerance mechanism (read more below)
coldstart performance
- actors have negligible coldstart performance. the code to run the actor is already started (ie the runner), so creating/starting an actor is incredibly cheap.
- creating new actors with a key requires some overhead to communicate with other regions in order to reserve the actor’s key (see below). actors can be created without keys with near-0 latency.
multi-region, globally unique actor keys
- acotrs can optionally have a globally unique “key”
- when creating an actor with a key
- this system is highly optimized to reduce wan round trips using a process call EPaxos with a custom database called Epoxy (link to https://github.com/rivet-dev/rivet/tree/main/engine/packages/epoxy)
- limitation: when creating an actor with a given key, that key will always be pinned to that region even if the actor is destroyed. creating a new actor with the same key will always live in the same region.
- see the acotr keys document
input
- actors have input data that can be passed to them when constructed
- this is similar to apssing data to a constructor in an object
generic parameters
actor definitions include the following generic parameters that you’ll see frequently in the code:
- state
- conn state
- conn params
- ephemeral variables
- input data
- (experimental) database connector
persistence
- state automatically flushes to storage intelligently
- to force a state flush and wait for it to finish, call (TODO: look this up in state document)
- read more about state persistence in the state document (link to document)
- state is stored in the same place as where the actor lives. loading an actor in to memory has comparable performance to network attached storage, and once in memory, has performance of any standard in-memory read/write like a variable.
scheduling & alarms
- actors have a scheduling api to be able to wake up at any time in the indefinite future
- think of this like setTimeout but without a max timeout
- rivet is responsible for waking the actor when this timeout wakes
ephemeral variables
- actors have the ability to create ephemrla variables for things that you do not want to be persisted with the actor’s state
- this is useful for non-serializable data like a utility class like a pubsubs erver or something (TODO extra info)
- link to ephemeral variables docs
actions
- for stateless clients, actions are sent as http requests via
POST /gateway/{actor id}/actions/{action name} - for stateful clients, actions are sent as websocket messages
events & subscriptions
- events are sent as websocket messages
error handling
- this is different than fault tolerance:
- error handling is a user error
- fault tolerance is something goes wrong that your applciation was not built to handle (ie hard crash, oom, network error)
- rivet provdies a special UserError class to throw custom errors that will be returned to the client
- all other errors are returned as a generic “internal error”
- this is becuase leaking error deatils is a common security hole, so we default to expose-nothing errors
logging
- rivet uses pino for logging
- we expose a scoped child logger for each actor at
c.logthat automatically logs the actor id + key - this allows you to search lgos easily by actor id without having to log the actor id frequently
- logs can be configured via the
RIVET_LOG_LEVELenv var
fault tolerance
- actors are fault tolerant, meaning that the host machine can crash and the actors will proceed to operate as if nothing happened
- runners maintain a socket with rivet engine. when this socket closes or takes to long to ping, actors will reschedule
- hibernating websockets (enabled by default) will live-migrate to the new actor as if nothing happened
crash policy
- there are 3 crash policies: sleep, restart, and destroyed
- sleep (default, usually the option you want):
- when to use: actors that need high-performance in-memory logic.
- when not to use: you need this actor running at all times no matter what, even if idle
- examples: (list commone xamples)
- destroy:
- when to use: actors that need to run once until completion. on crash, do not try to reschedule.
- when not to use: if you want your actor to have fault tolerance and be able to run transaprenlty to the underlying runner
- examples: batch jobs, image conversions, ephemeral jobs, (TODO come up with better eaxmples)
- restart:
- when to use: actors that should be running at all times
- when not to use: if you don’t absolutely need something running at all times, since this consumes needless compute resources. considure using the scheduling api instead.
- examples: maintain outbound sockets, daemons, always-running jobs, (TODO come up with better examples)
- sleep (default, usually the option you want):
the behavior for each is described below:
| Event | Restart | Sleep | Destroy |
|---|---|---|---|
| Graceful exit (StopCode::Ok) | Destroy | Destroy | Destroy |
| Crash (non-Ok exit) | Reschedule | Sleep | Destroy |
| Lost (runner disappeared) | Reschedule | Sleep | Destroy |
| Lost + force_reschedule | Reschedule | Reschedule | Reschedule |
| GoingAway (runner draining) | Reschedule | Sleep | Destroy |
| No capacity (allocation) | Queue (wait) | Sleep | Queue (wait) |
| No capacity + serverless | Queue (wait) | Queue (wait) | Queue (wait) |
| Wake signal (while sleeping) | Reschedule | Reschedule | Reschedule |
inspector
- actors provide an inspector api to implement the:
- repl
- state read/write
- network inspector
- event log
- this is impelmented over a websocket over bare
http api
- see the http api document on actors
multi-region
- actors can be scheduled across multiple regions
- each actor has an actor id which embeds which region it lives in
- networking is automatically routed to the region that an actor lives in
- limitation: actors curretnly cannot migrate across regions
backpressure
no runner capacity
-
this is how actors with different crash policies behave when when there’s backpressure:
- sleep = sleeps (sheds load by not rescheduling)
- restart = queues
- destroy = queues
-
see the above matrix for more details on actor crash policy on how it handles no capacity.
-
the actor queue is built to withstand high amounts of backpressure on rivet, so queueing actors is fine here
-
a large queue means it’ll take more time for your application to process the queue to catch up with demand when it comes online.
per-actor cpu & networking exhaustion
- actors are isolated, so they each have their own individual bottleneck. you can think of this like a process thread where each thread can only do so much.
- there is no durable message queue/“mailbox” for actors. if the actor cannot respond in time, then the request is dropped.
- if an actor exhauses its cpu or networking, then the runner
- returns service unavailble (503) if the actor fails to respond to a request in time
- there is no hard cap on the networking or cpu usage for each actor at the moment
- if your actor is resource intensive, it’s common to use a separate mailbox actor to act as a queue
runners
regular vs serverless runners
there are 2 types of runners:
- regular: these are standard nodejs processes connected to rivet that rivet can orchestrate actors to and send network requests to at any time
- serverless: rivet works with serverless platforms. when an actor is created, it has a request-per-actor model where it opens a long-running request on the serverless platofrm to run a given actor.
runner pool
- runners are pooled together by sharing a common name (ie “default”)
- when an acotr is created, it chooses the pool by selecting the runner name to run on
- rivet will automatically load balance actors across these runners
runner key
- not relevnat for serverless runners
- each runner has a unique key that it provides when connecting. keys are unique to the instace the runner is running on and should be the same if the runner is restarted.
- this can be the: machine’s ip, k8s pod name, etc
- if there is an existing runner connected with a given key, the runner will disconnect the old runner and replace it
- rivet is designed to handle network partitions by waiting for runners to miss a ping, indicating it’s no longer alive. however, often times runners restart immediately after a hard crash and reconnect. in this case, the runner will reconnect on restart and terminate the old runner in order to prevent further actors from scheduling to the crashed runner.
capacity
- not relevnat for serverless runners
- each runner can be assigned a capacity of how many actors it can run
- rivet will schedule with spread (not binpacking) in order to spread load over actors
usefulness of capacity = 1
- setting a capacity of 1 is helpful for situations where you have cpu-intensive apps that should not run with any other actors
- examples include game servers, ffmpeg jobs, etc
versions & upgrading code
- each runner has a version index
- actors are always scheduled to the highest verison index (see runner priority below)
- this means that when a new runner is deployed:
- runners with higher index come online
- actors schedule to the highest index, stop scheduling to the older index
- old index runners start draining and migrating actors to new index
- all old runners are now shut down
- websocekts are live migrated to the new version when upgrading using hibernating websockets to users see no hiccup in their applications
- this is important because actors should never downgrade their runner. they should always move to a newer version of code in order to prevent corruption.
runner scheduling prioroty
- actors are scheduled to runners sorted by priority of (version DESC, remaining capacity ASC)
multi-region
TODO
shutdown sequence
- runner shutdown is important to ensure that actors do not get unexpectedly terminated when either:
- upgrading your applciation and taking down old pods
- scaling down your runners horizontally (ie from an hpa)
- pressing ctrl-c when in development
- on shutdown:
- tell rivet the runner is stopping
- rivet tells all the actors on this runner to migrate
- runner waits for all actors to finish migrating
- runner exits process
reconnection
- runners can handle temporary network partitions
- they’ll automatically reconnect and replay missed commands/events between rivet and the runner
- this happens transparenlty to the user
- if disconnected for too long (indicating a network partition), the runner will shut itself down and exit
autoscaling
- not relevant to serverless
- runners currently autoscale on cpu. more intelligent scaling is coming soon.
- tune your runner total slots capacity accordingly
- it’s up to you to configure your hpa/etc to work like this. see the Connect guides (link to index page) for reference on hwo to configure this.
serverless timeouts
- serverless runners take in to account the maximum run duration of the serverless platform
- the runners will mgirate actors to a new request before the request times out
- this is completely transparent to you and the user because of the fault tolerance and websocket migraiton characteristics
- it’s common for actors to go sleep before hitting the serverless timeout
networking
web standards
- everything in rivet is built on webstandards by default
- nothing in rivet requires you to use our sdk, our sdks are meant to be a convenience. it’s built to be as easy to use raw http/websocket endpoints from scratch.
- actions, events, etc are all built on simple, well-documented http/websocket under the hood (link to openapi & asyncapi docs).
- you can use low-level request handlers (lnk to dock) and low-level weboscket handlers (link to doc) to handle low-level primtivies yourself
encoding
- rivetkit’s action/events api supports communicating via [VBARE](link to github repo, see the blog post for the link), CBOR, or JSON
- VBARE: high-perofrmance & compact, optimal use case
- CBOR: descent encoding/decoding perf + portable libraries, good for implemnting high-ish performance on other platforms
- JSON: good for fast implementations & debugging (easy to read)
tunneling
- when a runner connects it opens a tunnel to rivet to allow incoming traffic
- this is simila to systems like tailscale, ngrok, or other vpns
- we do this for security & configuraiton simplicity since it means that you don’t have to manage exposing your rivetkit applications’ networkig to rivet. instead, anything that can open a socket to rivet can accept inbound traffic to actors.
gateway
- incoming traffic to actors come to the Rivet gateway and are routed to the appropriate runner
- the rivet gateway automatically handles:
- multi-region routing to route traffic to the correct reigon for an actor
- automatically waking the actor if needed
- sending traffic over the runner
hibernating websockets
- hibernating web sockets are a core component of live actor migration & fault tolerance. it allows us to maintain an open websocket while the actor crashes, upgrades, or moves to another runner. TODO: copy the rest of this from low-level webosckets document and rephrase
actor health endpoin
- actors provide a simple, utility health endpoint at
/healththat lets you check if your actor is reachable (e.g.curl https://api.rivet.dev/gateway/{actor id}/health)
multi-region
networking
- actors may live in different regions than inbound requests
- rive tuses the Epoxy (link again) system to handle global routing to route traffic to the correct region with high performance
- this is completely transparent to you. your app sends traffic to https://api.rivet.dev/gatewa/* and it automatically routes to the correct actor in the appropirate region
globally unique actor keys
- acotr keys are globally unique to be able to benefit from multi-region capabilities without any extra work
- see more about globally uniuqe actor keys above
- see the actor keys document
regional endpoints
- each reigon has a regional endpoint
- this endpoint is used specifically for connecting runners (for example https://api-us-east-1.rivet.dev), opt to use api.rivet.dev for all other traffic
- runners are required to connect to the regional endpoints
- this is because runners are sensitive to latency to the rivet regional datacenter
- we add datacenters regularly so each runner needs to be pinned to a single datacenter in order to ensure your availble datacneter list doesn’t change sporadically without your consent
persistence
- data is always persisted in the same region that is written
- this is important for minimal coldstarts & data locality laws
namespaces
- rivet provides namespaces to run multiple actor systems in isolation
- this makes it really easy to have prod/staging environments or completely different applications running on the same rivet instance
- when you connect to rivet, you can specify which namespace you’re connecting to
- self-hotsed rivet defaults to namespace “default”
- rivet cloud provdies isolated tokens for each namespace
manager api
- rivet provides a standard rest api for managing actors
- useful endoints include:
- get /actors
- delete /actors/
- get /actors/names -> get all actor types available
- useful endoints include:
comparison to prior art for actors
runtime
- there are very few serious actor implementation targeted at the javascirpt eocsystem. rivet is arguably the most serious open-source actor implementation for typescript out there.
library vs orchestrator
- some actor systems opt to be purely a library while rivet opts to have an orchestrator (i.e. the single rust binary)
- this lets us to a lot of things other actor systems can’t:
- separating orchestration, persistence, and proxy lets us isoalte the core to be incredibly reliable while the fast-changing applications that ocnnect to rivet can be more error-prone safely. with a library, the blast radius of your application also affects the entire actor system.
- support for serverless platforms to benefit from cost, multi-region, blitz scaling, and relibaiblity benefits
- optimize fault tolerance since we can make more assumtions about application state when the rivet core does not crash and your app does
scheduling
actors is a loose term, but there are generally 2 types of schedulign in practice:
- ephemeral actors
- examples: erlang/otp, akka, swift
- provides no persistence or sleeping mechanism by defualt
- relies on supervisors for managing persistence
- virtual actors
- an extension of the actor pattern that provides actors that can hibernate (“sleep”) when not in use
- examples: orleans, dapr, durable objects
rivet has similarities with both to provide more flexibility:
- crash policies provdie for 3 types of actors:
- sleep -> most similar to virutal actors
- restart -> most similar to ephemeral actors but with a supervisor to auto-restart, however still has a durable queue ot handle backpressure
- crash -> most similar to traditional actors but with no supervisor to restart, however still has a durable queue to handle backpressure
communication
- many actor frameworks use inbox patterns (think: queue-per-actor) to handle sending messages between actors
- there is no callback mechanims, instead you need to send messages back to the actual actor
- rivet opts to behave like web standards instead of using the message pattern
- actors can impelment the inbox pattern optionally
- but we provide lower-level networking to be able to be compatible with more techniologies
- rivet assumes the same serial concurerntly that other actors do (by the nature of javascript being single-threaded) but we allow you to run promises in parallel or handl eyour own concurrency control (which some other actor frameworks might require a spawning new actor to do)