Building an Engineering Operating System That Scales

Situation

At FinanceOps, with a team of 10-20 engineers over the course of roughly three years, the team was shipping — but too much of the work still depended on memory, interpretation, and experienced people manually filling gaps. That is common in growing environments, but it becomes expensive as complexity and expectations rise.

An earlier architectural choice illustrated this cost clearly. The original stack optimized for speed and familiarity, but for a platform with transactional and reporting-heavy requirements, the data model became a poor fit over time.

Join-heavy data patterns led to performance issues that eventually required a full database migration — a costly undertaking during which machines had to be overprovisioned just to keep basic workflows running. The migration to the new architecture has been underway for roughly three months, with another two to three months to fully complete. New clients are onboarded directly onto the new architecture while legacy integrations are still being ported.

Stakes

Without a stronger operating model, quality becomes uneven, important decisions get re-litigated, and execution starts feeling more expensive than it should. Without built-in observability, the team relied on hunches instead of data until custom tooling was built — a gap that made every operational decision slower than it needed to be.

My role and scope

I focused on improving the mechanisms around the team:

Decision making
Documentation
Testing expectations
Release confidence
Architectural clarity
Stakeholder visibility

Constraints

The team still had to ship
There was limited room for heavy process
Improvements had to be adopted naturally, not enforced mechanically
The operating model had to support different kinds of work

Approach

I looked for repeated points of friction and designed lighter-weight systems to reduce them.

Examples included:

Clearer documentation for important technical decisions
Stronger expectations around what "ready to ship" means
More consistent quality and smoke-test thinking
Better visibility into risks and assumptions
More reusable context so less depended on tribal knowledge

Key decisions

Write down the important things

Reusable context is one of the highest-leverage forms of engineering infrastructure. When decisions, rationale, and context live in writing, they compound over time instead of fading with memory.

Improve decision quality before adding process

Many teams add meetings before fixing clarity. I prefer the reverse. Sharper inputs — clearer problem statements, better-scoped options, explicit trade-offs — tend to resolve more than an extra calendar invite ever will.

Approach rewrites with intention, not frustration

After three years on the original architecture, a clearer understanding of client types, workloads, and usage patterns made a rewrite viable. I approached it with a set of principles:

Involve stakeholders earlier so alignment happens before code is written
Document first to force clarity on goals and boundaries
Define boundaries early so teams know what they own
Design for observability from day one rather than retrofitting it later
Share the architecture rather than impose it — discuss individual pieces with the people who will own them
Make drift harder by anchoring decisions in written, reviewable artifacts

Stakeholders aligned earlier, drift became harder, and the team felt genuine ownership.

Make safer execution the easier path

The best standards are the ones teams can follow naturally. When the safe choice is also the easy choice, compliance becomes a side effect of good tooling rather than a burden on discipline.

Outcomes

Less re-discussing of past decisions
Clearer expectations around release readiness
Safer handoffs across engineering and stakeholders
Lower dependence on tribal knowledge to keep delivery moving
Faster queries and lower infrastructure costs after migrating services to the new architecture
Higher throughput and capacity — the new event-based architecture is scalable, extensible, and maintainable, allowing new requirements without disrupting mission-critical services

What I learned

A strong engineering culture is not only values and hiring. It is also the set of repeatable mechanisms that help a team think, decide, and ship well under pressure.

Not all shortcuts are equal. Optimizing for familiarity is fine only when the decision is easy to reverse later. Foundational choices around data model and observability become very expensive if treated like temporary shortcuts.

And rewrites should be driven by clearer understanding, not frustration — architecture should be intentional and shared.