Security

How Failkit is built.

Last updated: 2026-04-28

1. Per-tenant isolation

Every Failkit customer gets their own Kubernetes namespace, their own Postgres database, and their own subdomain (<customer>.failkit.com). Workloads do not share processes, network policies, or DNS records with other tenants.

Compute: dedicated namespace with NetworkPolicy denying cross-namespace traffic by default.
Database: dedicated CNPG-managed Postgres database with a tenant-scoped role; cross-database access is not granted.
Defense-in-depth: table-level Row-Level Security (RLS) policies enforce a app.current_tenant GUC even within a tenant’s own DB — if a code path forgets to scope a query, RLS blocks it.
Identity: tenant admins are issued in the tenant’s own DB; the management plane mints short-lived bootstrap tokens (5-minute TTL, replay-protected) to seed them.

Enterprise customers can opt into dedicated tenancy: a separate cluster on isolated infrastructure with no multi-tenant components.

2. Encryption

In transit

All public traffic uses TLS 1.2+ with modern cipher suites. Certificates are issued by Let’s Encrypt via DNS-01 ACME and renewed automatically.
Replication agents communicate with the management plane over mTLS using a per-tenant client certificate signed by an internal CA.
Internal pod-to-pod traffic is constrained by Calico NetworkPolicies; sensitive paths additionally use mTLS via the service mesh.

At rest

Postgres data is encrypted at rest using cloud-provider KMS (AES-256). Backups are encrypted with the same key envelope.
Replicated VM payloads are encrypted by the agent before leaving Customer’s environment, using keys stored in Customer’s own cloud-provider KMS. Failkit cannot decrypt those payloads.
Object-store buckets used as replication targets sit in Customer’s own cloud account, under Customer’s own IAM policies.

3. Access control

Customer access

Tenant portals support email + password by default (bcrypt-hashed, 12 cost rounds).
Professional and Enterprise tiers include SSO via SAML 2.0 and OIDC.
Role-based access control inside the tenant: owner, admin, operator, auditor. Audit-only roles cannot trigger drills or modify protection groups.

Failkit employee access

Failkit staff cannot access Customer Data without an explicit, time-bound support request from Customer.
Production access is granted via short-lived (8-hour) credentials brokered by an SSO-fronted bastion. All access is logged.
Production secrets are managed in HashiCorp Vault; no long-lived credentials are stored on workstations.
Mandatory hardware MFA (FIDO2) for all employees with production access.

4. Replication & drill data path

The replication path is designed so Failkit’s management plane never holds Customer VM payloads:

The on-prem agent reads VM blocks from the source hypervisor (vSphere or Hyper-V).
Blocks are encrypted client-side and uploaded directly to the customer-controlled object-store bucket (S3, Azure Blob, GCS).
The agent reports metadata to failkit.com: byte counts, checksums, RPO observed. The bytes themselves never traverse Failkit infrastructure.
During a drill, the orchestrator instructs the customer’s cloud account to spin up VMs from the replicated blocks. Drill VMs run on Customer’s own compute, billed to Customer’s own cloud account.

A single Failkit incident cannot expose Customer VM data, because we don’t hold it.

5. Auditability

Every drill produces a signed PDF report with timestamp, RPO/RTO measured, VM-by-VM boot result, and a SHA-256 hash chained to the prior report. Signatures use Ed25519 with a per-tenant key.
Reports are retained per the tenant’s configured retention policy (default 7 years for Enterprise, 1 year otherwise) and are downloadable directly from the portal.
All administrative actions (user changes, policy changes, drill triggers) emit immutable audit events stored alongside the tenant’s own DB; auditors can be granted read-only access.

6. Vendor & sub-processor list

The list of sub-processors and the data each handles is on the Privacy page.

7. Compliance roadmap

SOC 2 Type I: target Q4 2026. Auditor selection in progress.
SOC 2 Type II: target H1 2027.
ISO 27001: target 2027 following SOC 2 Type II.
HIPAA: Business Associate Agreement available for Enterprise on request; the platform’s data path is compatible (PHI never traverses Failkit infrastructure) but BAA is signed per-customer.
FedRAMP / IL-4: not on the near-term roadmap. Contact us if you need it.

8. Incident response

Failkit maintains a documented incident-response runbook. We will notify affected customers within 24 hours of confirming a security incident affecting their data, with a written postmortem within 14 days. Status of ongoing incidents is published at status.failkit.com (planned).

9. Reporting a vulnerability

We welcome coordinated disclosure. Email [email protected] with details. PGP key fingerprint will be published here once the inbox is signed off. We commit to:

Acknowledging receipt within 2 business days.
Triaging within 5 business days.
Not pursuing legal action against good-faith researchers who follow this process.

Out-of-scope: denial-of-service testing, social engineering of Failkit employees, physical attacks on Failkit facilities or vendors.