← All projects

Mid-cap fintech

SRE platform for a real-time trading system

Built an SLO framework, observability stack and chaos-engineering practice — mean time to recovery down from 47 to 6 minutes.

At a glance

Client
Mid-cap fintech
Industry
Financial services
Project duration
6 months + ongoing operations
Team
2 SREs, 1 engineer

Starting point & goal

A mid-sized fintech runs a real-time trading system where every minute of downtime costs money directly. Incidents were handled reactively and ad hoc — without clear metrics and without dependable alerting.

Challenges

  • Mean time to recovery of 47 minutes for critical incidents
  • No shared understanding of availability: SLOs were missing
  • Alert fatigue from hundreds of unprioritised alerts
  • High load peaks at market open

Implementation

We established an SLO framework that makes availability measurable and built an observability stack that shows causes instead of symptoms. Runbooks, on-call processes and regular chaos-engineering exercises keep the team ready to act — before the emergency happens.

Tech stack

Infrastructure

  • Kubernetes
  • Terraform

DevOps & observability

  • Prometheus
  • Grafana
  • Loki
  • PagerDuty

Languages & frameworks

  • Go
  • TypeScript

Results

01

Mean time to recovery reduced from 47 to 6 minutes

02

99.98% availability in the first year of operations

03

Alert volume cut by 80%