Engineering Systems
Building an Engineering Operating System That Scales
Reduced recurring delivery friction by making architecture decisions, release expectations, and team context easier to share and reuse.
Situation
At FinanceOps, with a team of 10-20 engineers over the course of roughly three years, the team was shipping — but too much of the work still depended on memory, interpretation, and experienced people manually filling gaps. That is common in growing environments, but it becomes expensive as complexity and expectations rise.
An earlier architectural choice illustrated this cost clearly. The original stack optimized for speed and familiarity, but for a platform with transactional and reporting-heavy requirements, the data model became a poor fit over time.
Join-heavy data patterns led to performance issues that eventually required a full database migration — a costly undertaking during which machines had to be overprovisioned just to keep basic workflows running. The migration to the new architecture has been underway for roughly three months, with another two to three months to fully complete. New clients are onboarded directly onto the new architecture while legacy integrations are still being ported.
Stakes
Without a stronger operating model, quality becomes uneven, important decisions get re-litigated, and execution starts feeling more expensive than it should. Without built-in observability, the team relied on hunches instead of data until custom tooling was built — a gap that made every operational decision slower than it needed to be.
My role and scope
I focused on improving the mechanisms around the team:
- Decision making
- Documentation
- Testing expectations
- Release confidence
- Architectural clarity
- Stakeholder visibility
Constraints
- The team still had to ship
- There was limited room for heavy process
- Improvements had to be adopted naturally, not enforced mechanically
- The operating model had to support different kinds of work
Approach
I looked for repeated points of friction and designed lighter-weight systems to reduce them.
Examples included:
- Clearer documentation for important technical decisions
- Stronger expectations around what "ready to ship" means
- More consistent quality and smoke-test thinking
- Better visibility into risks and assumptions
- More reusable context so less depended on tribal knowledge
Key decisions
Write down the important things
Reusable context is one of the highest-leverage forms of engineering infrastructure. When decisions, rationale, and context live in writing, they compound over time instead of fading with memory.
Improve decision quality before adding process
Many teams add meetings before fixing clarity. I prefer the reverse. Sharper inputs — clearer problem statements, better-scoped options, explicit trade-offs — tend to resolve more than an extra calendar invite ever will.
Approach rewrites with intention, not frustration
After three years on the original architecture, a clearer understanding of client types, workloads, and usage patterns made a rewrite viable. I approached it with a set of principles:
- Involve stakeholders earlier so alignment happens before code is written
- Document first to force clarity on goals and boundaries
- Define boundaries early so teams know what they own
- Design for observability from day one rather than retrofitting it later
- Share the architecture rather than impose it — discuss individual pieces with the people who will own them
- Make drift harder by anchoring decisions in written, reviewable artifacts
Stakeholders aligned earlier, drift became harder, and the team felt genuine ownership.
Make safer execution the easier path
The best standards are the ones teams can follow naturally. When the safe choice is also the easy choice, compliance becomes a side effect of good tooling rather than a burden on discipline.
Outcomes
- Less re-discussing of past decisions
- Clearer expectations around release readiness
- Safer handoffs across engineering and stakeholders
- Lower dependence on tribal knowledge to keep delivery moving
- Faster queries and lower infrastructure costs after migrating services to the new architecture
- Higher throughput and capacity — the new event-based architecture is scalable, extensible, and maintainable, allowing new requirements without disrupting mission-critical services
What I learned
A strong engineering culture is not only values and hiring. It is also the set of repeatable mechanisms that help a team think, decide, and ship well under pressure.
Not all shortcuts are equal. Optimizing for familiarity is fine only when the decision is easy to reverse later. Foundational choices around data model and observability become very expensive if treated like temporary shortcuts.
And rewrites should be driven by clearer understanding, not frustration — architecture should be intentional and shared.