There’s a question I ask at almost every first meeting with a new client. I ask it deliberately — almost provocatively.
“How do you find out when something goes wrong with your product?”
Over the years, I’ve heard dozens of variations of the same answer. Sometimes it’s “well, we have dashboards,” sometimes “our DevOps guy monitors the servers,” sometimes just a confused pause. But the most honest — and, unfortunately, the most common — answer sounds like this: “A customer messaged us.”
That means one thing: by the time you found out about the problem, it had already happened. Someone couldn’t complete a payment, place an order, upload a file, or get a result. Someone wasted their time — and left. And you found out from a message in a chat app or a negative review.
That’s exactly where the right conversation about monitoring begins.
In most companies, the word “monitoring” is associated with something technical and boring. It’s a job for a DevOps engineer, a set of graphs with CPU and memory metrics, uptime alerts at 3 a.m. Something important but peripheral — in the category of “need to have, but don’t need to understand.”
But the question isn’t “is the system running.” The question is “is the business logic working the way it’s supposed to.”
These are fundamentally different questions. And most monitoring systems only answer the first one.
Imagine you own a restaurant chain. You have thermometers on all the refrigerators, electricity meters, security cameras at the entrances. You know the equipment is running.
But that doesn’t mean the chef didn’t mix up ingredients. That the waiter didn’t forget table number seven. That the cashier didn’t make a mistake on the bill. That the customer who waited forty minutes for their order didn’t walk out and leave a review.
Technical monitoring tells you: “The fridge is working, the lights are on, the doors open.”
Business monitoring tells you: “The customer sat down, ordered, received their meal in 18 minutes, paid, and left a tip.”
One without the other is an incomplete picture. But most digital companies live with only the first part.
At Gart Solutions, we’ve arrived at a simple model that helps our clients see exactly where their blind spot is.
The first level — infrastructure.
This is the foundation. Servers, processors, memory, network, uptime. It answers one question: is the system on at all? Without this level, everything else is meaningless. But on its own, it gives you nothing beyond the basic assurance that “the light bulb hasn’t burned out.”
The second level — platform.
This is more complex. Databases: how many connections, how fast are queries executing, are there any delays? Message queues: how quickly are tasks being processed, is there a backlog building up? Load balancers, API gateways, inter-service communication. This level answers the question: “Are the gears turning?” Context starts to appear here — an overloaded queue signals that processes can’t keep up; a slow database explains rising latency. But even here, there’s still no answer to the question the business actually cares about.
The third level — business logic.
This is where the most important work begins — and where there’s most often a gaping hole. How many transactions successfully completed in the last hour? What share of users reaches the final step in the key scenario? How long does a critical operation take — and does that match what you promised the customer? Where exactly in the user flow do errors occur?
This level isn’t standardized. For an e-commerce store, it might be cart-to-payment conversion. For a SaaS product — time to first “aha moment” for a new user. For a logistics platform — the percentage of orders that moved from “accepted” to “delivered” without manual intervention. Every product has its own business logic, which is exactly why this level has to be built by hand, with a real understanding of how value is actually created.
One of our projects that illustrates this approach is elandfill.io, a platform developed by ReSource International.
At first glance, it sounds unusual: a digital platform for landfill management. But behind it sits a very serious business problem. Landfills are complex sites with strict regulatory requirements, environmental risks, and massive volumes of data. One of the core tasks is predicting methane emissions — which directly affects both safety and regulatory reporting.
To do this, the platform collects data from drones, converts it into 3D models of landfill sites, and provides real-time analytics. It sounds straightforward — but under the hood, it’s a complex distributed system with several interdependent components.
Picture this scenario: a specialist goes out to a landfill site, a drone captures a survey, and after returning, they upload the data into the system. The files are anything but small — between 2 and 10 gigabytes per survey.
After upload, a multi-stage process kicks off:
First, the file is received by the system and goes through initial validation. Then comes compression: the data is optimized for further processing. Next, transformation into a 3D model launches — the heaviest part of the entire process. The finished model is integrated with geospatial data and displayed on a map. Only after all of that can an analyst work with the result: assessing waste volumes, comparing surveys over time, forecasting environmental indicators.
This entire chain isn’t “press a button and wait.” It’s a sequence of dependent steps where a failure at any stage means no result at all. And the client’s expectation is concrete: they uploaded a file and expect to see a finished model within a reasonable amount of time.
The platform’s architecture is distributed across several components: the frontend where the user works; backend services that coordinate the process; a message queue for asynchronous tasks; data processing services; and a dedicated 3D engine that handles the heaviest part of the work.
That last component is special. The 3D engine doesn’t run continuously. It starts up for a specific task, consumes significant CPU and GPU resources, and shuts down when finished. The system is dynamic: services appear and disappear depending on load.
And this is exactly where classical monitoring simply doesn’t work. If a service disappears — is that normal or a problem? If the 3D engine isn’t responding — is it still processing a task, or has it crashed? If the queue is empty — is that good, or does it mean tasks aren’t reaching the queue at all?
Without the context of the business process, any of these situations can look like either a normal state or a disaster.
The key decision we made at the outset: shifting the focus from monitoring services to monitoring the process.
We built a unified observability system — part of a broader Resource Management Framework — that reflects not individual components, but the complete lifecycle of every task.
The central dashboard shows in real time: the status of each platform service, application versions, the last-seen timestamp for each component, the state of each processing stage for current tasks, execution time for each operation and deviation from expected norms, and errors and bottlenecks.
At the same time, new services that appear dynamically in the system are automatically registered in monitoring — without manual configuration. The platform can scale, and the observability system scales with it.
We implemented an approach where monitoring, observability, and automation work as a single system.
Observability — collecting and structuring data across the entire chain, from UI to processing. Not just logs from individual services, but a contextual picture: exactly what’s happening with a specific task right now and what stage it’s at.
Monitoring — tracking key parameters tied to business norms. Not just “CPU at 80%,” but “file processing at stage three has been running for 12 minutes against a 5-minute baseline.” Not just “the queue isn’t empty,” but “15 tasks have accumulated in the queue and none have been processed for 3 minutes.”
Alerting — automatic notifications when deviations occur. An important nuance: alerts are integrated with Microsoft Teams, so the team receives them where they already work — not in some separate tool no one opens. The alerts carry context: not just “something went wrong,” but “task ID 4821 is stuck at the compression stage, waiting 8 minutes, expected baseline is 2 minutes.”
Automation — and this is probably the most important layer. The system doesn’t just signal — it acts. When there’s a load spike, an optimization script launches automatically. When a service fails — a restart. When one node is overloaded — tasks are routed to another. This is what we call a self-healing system: a system that doesn’t just detect a problem but resolves it — or minimizes its impact — before a human even notices.
The results of this approach turned out to be far deeper than simply “fewer outages.”
First — complete process transparency. The entire journey from file upload to finished 3D model is visible in one place. No need to piece together the picture from five different tools and logs.
Second — fast diagnosis. When something goes wrong, the team immediately sees: at exactly which stage the problem occurred, how long the deviation has been running, which tasks are potentially at risk. Time from “something’s not working” to “here’s what and here’s why” dropped from hours to minutes.
Third — SLA control. Now it’s possible to not just promise a client “processing will take a few minutes,” but to define and enforce specific parameters: compression — under 3 minutes, 3D transformation — under 15 minutes depending on file size. And to react instantly when reality deviates from what was promised.
Fourth — cost optimization. Resource-intensive components — the 3D engine in particular — are now tracked precisely: when they start and how much they consume. This makes it possible to optimize infrastructure costs without sacrificing service quality.
Thanks to all of this, the platform was able to confidently enter international markets — with deployments in Iceland, France, and Turkey. Not because it became “technically more stable” in some abstract sense, but because the team gained a tool that lets them make commitments to clients with confidence — and keep them.
I want to be honest: building monitoring the right way isn’t easy. And not because there’s a shortage of tools — there are more than enough. The difficulty comes from somewhere else.
To monitor business logic, you first have to articulate it. Which processes are actually critical? What’s the baseline for each one? What exactly constitutes a “problem” — versus a natural variation that doesn’t require a response?
These aren’t technical questions. They’re questions about understanding your product. And they need to be answered not just by engineers — but together with product managers, business owners, sometimes even customers.
That’s why at Gart Solutions, we start every monitoring project not with choosing tools, but with a workshop: together with the client’s team, we map out the key business scenarios, define baselines, and agree on what counts as a deviation. Only after that does it become clear what exactly to monitor — and then the technical implementation goes much more smoothly.
Monitoring is not a technical function. It’s a management tool.
When built correctly, it gives you a helicopter view: you can see where processes are running smoothly, where there’s tension, where problems are emerging — and you see it before they become noticeable to the customer.
It changes how the team works, too. Instead of “something’s broken, we need to figure it out,” you get “the delay at processing stage three has doubled over the last 20 minutes, here’s the cause, here’s the fix.” Instead of reactive firefighting — proactive service quality management.
The best time to build this system is before the first serious incident. Because after it, the cost of the question becomes clear — but the time to prepare has already been lost.
Gart Solutions. We build monitoring, observability, and automation systems for technology companies that are scaling.