The cost of data consistency across distributed data stores
What is distributed data?
Distributed data tends to be a consequence of distributed systems. Sometimes, all the computing activities necessary for a software system to meet user requirements may span multiple systems. This can be for different reasons ranging from performance, to scale and even security and reliability.
Backups and redundancy will be relevant if one of your data storage crashes, your system can keep working by switching to a different storage, this helps prevent a single point of failure.
Distributed data can have performance in some cases where it is possible to process data in parallel instead of being run synchronously arranged in a queue.
Why is data consistency important?
When data that represent the same entity are stored in multiple locations, it may be critical to make sure they're all in sync.
Data is the basis for good strategic planning, and inaccurate data can result in misinformed business plans. Especially when aggregating data from various internal or external sources, it is important for businesses to ensure data accuracy so that they can be confident and effective in making decisions and to reliably deliver services to their customers.
What could be the cost implications of inconsistency?
A recent Gartner research has found that organizations believe poor data quality to be responsible for an average of $15 million per year in losses.
Let’s take a bank’s data system for example: As pointed out in this Marklogic article, Inconsistent information across data silos in an organization leads to transactional risks such as inaccurate or even fraudulent transactions. Fake and fraudulent accounts should be caught early by processes that clean or detect dirty data. When they don’t, the bank is put at risk, and its reputation is damaged.
Here are some consequences of data inconsistency that may lead to financial loss by-and-large
1. Inconsistent data can give room for you system to get exploited by malicious attackers
2. As cited in the bank example above, inconsistent data can lead to loss of money. This can either be
- Directly - for example, an error in financial transactions in a payment system*
- Indirectly - for example, costly unproductive team efforts motivated by incorrect assumptions drawn from inconsistent data
3. As a consequence to the aforementioned, the integrity of your business/solution can be impacted adversely and you may lose users including prospects which you’d never converted.
This list is not exhaustive but we can already start to see how this could lead to an average financial loss of $15 million per year by organizations
Distributed data is not uncommon these days, some popular design patterns even sometimes necessitate it (e.g microservices). Anticipating the related cost is critical to ensuring service delivery and business continuity - both important ingredient of a robust IT infrastructure strategy.