Debug consistently: Easy way to instantly detect and correct rare inconsistencies in distributed systems

Key insight: Designing an integrity tool for transaction failures across a distributed system reduces the time complexity of resolving issues because the tool is designed for consistency. When built, the challenge is in deciding which errors qualify for incorporation into the integrity tool.

Challenge

As discussed in a previous case study, we built multiple startups rapidly by leveraging reusable components. Because our system leveraged reusable service components and the overall design was a distributed system, we encounter the challenges of the Saga pattern which lacks the ACID properties of usual monolithic systems.

This means that sometimes, a transaction object which spans multiple services can encounter failures before it completes successfully, or software bugs that did not expect failures can cause invalid transaction states. How would you detect these effectively? How can you recover or revert the transaction back into a valid state?

How did we respond to the potential for uncommitted transactions, lack of transaction isolation, and lack of rollback capabilities that span across multiple independent services?

What we did

Not all data failures are grave enough for it to merit to be designed into the integrity tool. However, as complexity increases, the strength of our tool for detection, analysis, and correction must also commensurately increase.

Post-error Detection and investigation – We didn’t recognize the data inconsistencies at first, but we later discovered them after tracing event logs of both internal states and external states (e.g. third party webhook event messages). After realizing the data inconsistencies and creating a timeline of transaction changes over the distributed system, we then perform our investigations to understand what could have caused the issues.

Resolution and stability – After identifying the key problem, we put together unit tests that meets the expected failure and success criteria on the failing service. This ensures that we were able to replicate the exact failure scenarios and ensure our software meets the expected behavior.

Preventative maintenance and verification – Finally, to ensure that our distributed systems’ data states are valid, we designed an integrity tool that verifies the state of a distributed transaction. Each verification is an “integrity check” composed of assertions/expectations of a valid transaction state. The integrity tool runs a selected integrity check and returns the number of transactions that have an invalid state, and responds with each of its transactions. The tool also details the exact failure scenario (the classification logic is encoded into the integrity tool).

Automatic data correction – Instead of manually tweaking the data on a selected environment, the integrity tool was also designed to rapidly apply the data changes on the transactions. This allows us to apply the changes on the dev environment and ensure our changes were applied successfully, before performing any permanent changes on a remote server.

Results

Where resolution time for a single complex transaction failure would take a time complexity of O(n * m) where:

  • n is the frequency of failure
  • m is the total time complexity of the problem; m = d + v + c
    • d is the average time to detect that the transaction has an issue
    • v is the average time to verify what type of error has occurred; and
    • c is the average time to correct the problem

The introduction of an integrity tool for transaction failures that meet a pre-defined complexity/risk criteria reduces the time complexity to O(n), because the complexity of the problem is reduced to constant time- as the crafted code handles detection, verification, and correction in a tested, and repeatable manner.

Reflections on cost

Given the strategy of designing an integrity tool that has the above capabilities, the tradeoff that our engineering leads considered was the following:

How do we effectively evaluate/approximate whether it is worth it to spend developer time (measured, for example, in sprint points or hours) to achieve greater confidence in our understanding of our system’s data state, verify and correct system’s data consistency, reduce technical debt?

At the time when we were addressing this problem, we were roughly estimating the impact of financial and operational risks to the business, considering things such as operational failures and deteriorating user experience and financial impact such loss of revenue or costs relating to refunds. See “How we could have improved” on how this analysis could have been better.

Reflections on usage

In distributed systems, software engineer leads should distinguish which models qualify to be considered evaluated by the integrity tool. Usually, there are data models that are more isolated or bound within the context of a service, while there are data models/transactions which appear in multiple bounded contexts (i.e. Saga pattern).

Not all data failures are grave enough for it to merit to be designed into the integrity tool. However, as complexity increases, the strength of our tool for detection, analysis, and correction must also commensurately increase.

Designing an integrity tool makes more sense to use on data models that are tailored to the core application logic rather than auxiliary services. This is because this data model has the most complexity- occurring in multiple services that introduces more potential coordination errors or network or error failures. These are issues that you don’t regularly encounter in a monolithic or isolated service that is kept simple, unlike data models that span different services that have potential for inter-service transaction failures, or may need inter-service transaction rollback capabilities.

Shortcomings/How we could have improved

The integrity tool is not a silver bullet. It’s a post-sickness medicine, and not preventative. Arguably, great software would not need an integrity tool just as healthy diet and exercise minimizes your need for multi-vitamins, weight-loss pills, and sleeping pills.

Good software/system design, thoughtful error-case handling, and effective software testing is preventative and efficient. Aim for this.

Nonetheless, when I reflect on how we performed and implemented this solution, I would say that there were the following shortcomings:

  • Consistent qualification – We could have been more specific about the criteria for when a transaction failure qualifies to be designed into the integrity tool, so that we have consistency in its usage and design policy.
  • Evidence-based and quantified – At the time, we were only approximating the impact of these failures, but often followed a “pursue 100% correctness” approach. However, continuing with a hard-working attitude to achieve correctness can prove to be expensive and inefficient way for a business to address the problem. A process that we could have employed was to create some form of scoring metric both for risk impact and probability.

So how might we be more evidence-based or data-driven in evaluating whether or not to pursue the development of this tool?

How to justify designing and using the tool

Remember the time complexity formula a while ago? When you still do not have the integrity tool designed or implemented, you can manually record the average time it takes for each of the variables mentioned above to quantify the cost of the manual process of investigating each problem. That way, you can calculate the ROI: the cost implementing one specific integrity tool detector/fix (measured in developer work-hours) versus the time saved (i.e. developer work-hours recovered; translated into savings).

Formula

O(n * m) where m = d + v + c

Example

In the past month, you’ve encountered the transaction failure 15 times (n = 15) and on average, it takes your team 4 work-hours before you find out that there is an issue (d = 4), it takes your team 12 work-hours to verify what kind of issue it is if it is an old issue or a new issue (v = 12), and it takes your team 2 work-hours to fix the problem (c = 2). In this month, your team would have spent (15)(4 + 12 + 2) = 270 work-hours or 11.25 work-days to address this problem manually. Each occurrence would cost 18 work-hours (2.25 work-days) to investigate, detect, classify, and fix.

Having the integrity tool would mean that if next month, you wanted to verify if the selected problem reoccurred, which transactions they were, and how to fix it, it would instead take minutes to fix everything, instead of 2.25 days for each occurrence. If next month it still occurred 30 times, the tool would be able to detect and flag the problem in minutes, not 22.5 days.

Now as an engineering lead, the question to evaluate is: can your team reliably spend less than 11.25 work-days this month to design and build the integrity tool component/fix for this problem?

Analyzing the probability of a transaction failure from occurring can be approximated based on frequency count of occurrences over total transactions per period (e.g. average of 4 transaction failures of 200 transactions in a week, resulting to a 2% failure probability rate).

To quantify the risk impact, there are various teams that you can collaborate with:

Marketing and Product – These teams work closely together as they are paying attention to what the user wants and needs as well as how we introduce and sell our products to users, as well as monitor their onboarding and usage experiences. How many users of ours falls into the use-case of this transaction? Is this a core transaction or not (e.g. remittance in a remittance platform is a core app feature; remittance cancellation is an outlier, infrequent capability). How much, precisely, are our users affected? (e.g. conversion rates dropped from 30% to 5% due to transaction failures)

Finance – When a transaction fails, what’s the financial impact to our business? Do we have regulatory compliance requirements, or supplier or partner compliance requirements such as Service Level Agreements? What are the financial consequences of not meeting them?

Operations – How do these transaction failures affect our customer support and operations teams? How many work-hours are committed to responding to and fixing these transaction failures operationally? How many work-hours in operations is affected, and how many work-hours in engineering can we spend to solve the problem?

Example

Marketing: This is a core feature that affects 30% of our users. We have 2,000 active users monthly, which means roughly 600 are affected. For the same period last month, we have reduced sales of 20% (loss of $240K), with this period at $960K (from last period: $1.2M).

Finance: This transaction failure will cause us to incur fines of $250 per failed transaction below $400, and $800 per failed transaction below $1000.

Operations: Our team is overworked trying to respond and handle these failures. We’ve identified that 240 transactions were affected this month. These issues spiked from 2% (16) per month to 30% (240) of our handled tickets this month. This prevents us from handling other important issues and cases in our team.

These are example questions that can be used to quantify the cost of these failures, so that the business can evaluate if investing in the design of an “integrity check” within an integrity tool is worth it or not.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.