Failure Recovery Loop
The Problem: The "Half-Finished" State Trap
In imperative scripting (Python, Bash, Go), if you have a script that does three things:
- Shuts down the primary database.
- Migrates the data to a new server.
- Updates the DNS routing.
If the script crashes at Step 2 (maybe the network drops, or the server runs out of memory), you are left in a catastrophic "Half-Finished" state. The primary database is off, the new one isn't ready, and the script is dead. To recover, a human has to manually figure out exactly where it failed and carefully reverse the steps.
The Code: The Failure Recovery Loop in Ved
Because Ved doesn't execute step-by-step scripts, it doesn't have "half-finished" states. It models the system as a continuous loop that can gracefully degrade and recover.
Here is how Ved handles a critical infrastructure failover and recovery loop:
domain DatabaseRouting {
state {
primary_status: string = "healthy"
active_route: string = "primary"
alert_sent: bool = false
}
// The Domain only rests if it is in one of two acceptable realities:
// Reality A: Everything is fine.
// Reality B: We are safely degraded, using the replica, and humans are notified.
goal RoutingStable {
predicate
(primary_status == "healthy" && active_route == "primary") ||
(primary_status == "failed" && active_route == "replica" && alert_sent == true)
}
transition ExecuteFailover {
step {
// If primary dies, immediately route to replica
if primary_status == "failed" && active_route == "primary" {
emit Network.UpdateDNS("replica")
active_route = "replica"
}
}
}
transition TriggerPagerDuty {
step {
// If primary dies, alert the humans
if primary_status == "failed" && alert_sent == false {
emit Alerts.PageEngineers("Primary DB Down")
alert_sent = true
}
}
}
transition AutoRestore {
step {
// If the primary comes back online, heal the system
if primary_status == "healthy" && active_route == "replica" {
emit Network.UpdateDNS("primary")
active_route = "primary"
alert_sent = false // Reset the alert flag for the future
}
}
}
}
How it Executes (The Chaos Scenario)
- Normal Operations:
primary_statusis"healthy",active_routeis"primary". Goal Reality A is met. The Domain sleeps. - The Crash: A massive cloud outage takes down the primary database. The Effect Adapter notifies the mailbox. The state updates:
primary_status = "failed". - The Degradation: The goal is instantly broken. The Ved runtime wakes up. It evaluates the transitions. Both
ExecuteFailoverandTriggerPagerDutyare now valid. - Parallel Execution: The runtime emits intents to update the DNS and page the engineers. Once both succeed, state is updated.
- Safe Fallback: The Domain evaluates the goal again. It now matches Reality B (safely degraded and alerted). The Domain goes back to sleep, having successfully halted the bleeding.
- The Auto-Recovery: Two hours later, an engineer fixes the primary DB. The Effect Adapter updates
primary_statusto"healthy". Reality B is suddenly broken! The runtime wakes up, triggersAutoRestore, routes traffic back to the primary, and returns to Reality A.
Behavior
- Service automatically restarts
- System stabilizes
Key Takeaways:
1. Rejection of "Panic" States
In traditional code, an error is an exception that causes the thread to panic and crash. In Ved, an error is just a change in state. The system doesn't panic. It simply says, "Okay, the state has changed. Are my goals still met? No? What valid moves do I have to reach a new goal?" This keeps the control plane entirely rational, even when the underlying infrastructure is on fire.
2. Codified "Graceful Degradation"
Normally, fallback logic is an afterthought tucked inside a catch block. In this Ved program, the fallback state (Reality B) is elevated to a first-class mathematical Goal. The compiler understands that running off a replica database is a perfectly valid, stable way for the system to exist, provided the engineers have been paged.
3. Independent Recovery Actions
Notice that ExecuteFailover and TriggerPagerDuty are separate transitions. Why is this important? Because if PagerDuty's API is temporarily down, the TriggerPagerDuty transition will fail and retry. But because it is decoupled, it does not block the DNS failover. The system will still save the database routing, even if the alerting system is broken. In a top-to-bottom Python script, a failed PagerDuty API call might crash the script before it ever reaches the DNS failover command.
4. "Self-Healing" is Free
The AutoRestore transition is beautiful because it requires no human intervention. Once the engineer fixes the physical database server, they don't have to remember to run a "restore-dns.sh" script. The Ved control loop notices the health check pass and autonomously walks the system back up the stairs to its optimal state.
Summary
Recovery becomes:
continuous stabilization