Single source of truth in data

This post is about something I learned in one of my client projects a few years ago.

Single source of truth in Wikipedia is defined as:

… the practice of structuring information models and associated schemata such that every data element is stored exactly once. Any possible linkages to this data element (possibly in other areas of the relational schema or even in distant federated databases) are by reference only.

This definition sounds overly complicated to me. So my definition of single source of truth is:

Anytime you store or retrieve data, it should come from one place and one place only. If you retrieve it twice or more, each instance of the data retrieved should be exactly the same.

Adhering to the single source of truth principle when designing your system seems like common sense that isn’t difficult to adhere to. Unfortunately (or fortunately, depending on how one might decide to treat this experience), I had an experience working on a project where this seemingly simple rule was violated.

I won’t mention the name of the project due to confidentiality reasons. I also wasn’t part of the initial design of the project, but instead came in once the project was completed and in the maintenance phase.

System Design of the project

The main application was a mobile app for both iOS and Android with the API provided by the client. The client provided three different APIs which all provided the same data but with different values. This is where the bulk of the problem began.

For example, let’s say that the API returned a data on restaurants. You’re making a request to the API for a specific restaurant that you have the primary key of. So you make a request to the all of the three endpoints requesting information on the same restaurant. But for some odd reason, the type of the restaurant, for example Brazilian restaurant, Italian restaurant, French restaurant and etc., that you get back from the three different endpoints are all different. How can this be?

It turned out that the client’s system was designed like more or less like this.

It turned out the reason for this was that the client was using the three different APIs for their current web application and wanted to display different sets of data for the same data depending on the page that the user was viewing. I’m not sure why this was the case, but that’s the way it was working live in production and what the client’s stakeholders wanted.

The API also had a problem of having extremely slow response times, which made it an issue for the mobile development team. So the solution that was proposed and implemented to alleviate the issue of having three different endpoints along with slow response times of the three APIs was to develop an intermediary Rails API that would make requests to the three existing APIs at different intervals throughout the day, aggregate the data, and figure out which data to store based on specific business rules provided by the client.

So the resulting system design ended up something like this.

This sounds “workable” in theory. You introduce another API that was to act as the “single source of truth” for the mobile apps and the response times of this new Rails API was fast. Problem solved right? Unfortunately, there were a few problems that came along with this.

1 – Complicated business rules

Not only were there were way too many business rules to account for, the business rules changed every now and then, which mean the crazy amounts of business logic in the Rails API had to be changed, increasing the chance of introducing bugs. In fact, due to the holes in the logic, many duplicate entries of the same data was stored in the Rails API. This resulted in the users seeing duplicate data in the mobile apps and sometimes just seeing stuff like “null” as the names for the data in the mobile apps. The only way to alleviate this issue was to literally console into the production database, pick out the incorrect data, and manually delete the data.

2 – Race against time

As mentioned above, this system design caused many duplicate data that had to be manually deleted in production. The worst part of it all was that because the Rails API was fetching data from the three different APIs at specific intervals, the duplicate data would be stored again and again throughout the day. The situation was literally

  1. Damn it, there’s a duplicate data in the database. Okay, carefully pick out the incorrect one so that I don’t accidentally delete the real data (but again, how do I know which one is real or fake?).
  2. Okay, deleted it safe and sound, whew! Ah shit, there’s another one? Okay, delete that too.
  3. Shit, the background scheduler is going to run in 10 minutes, that means more duplicate data. !@#!@#!@#!@#!@#.
  4. Try my best to patch up the business rules (and then have them change which eventually broke more things) and delete duplicate data throughout the day.

Maintaining the system was literally a race against time playing a video game that was impossible to win.

What should have been the ideal solution?

There was a solution proposed by the tech team on both sides (our company and the client) where the client would provide a single endpoint from where the Rails API would start retrieving its data from. This way, the responsibility of keeping data integrity would be up to the API designed by the client and not the data fetched from three different endpoints and aggregated in the Rails API.

Unfortunately, I left the company before the solution was implemented so I have no idea whether this implementation succeeded or not. Looking back on it now, I don’t think keeping the Rails API would have been the correct solution. The ideal solution would have been to have the client provide a proper API with correct data that the clients would manage that the mobile apps would make requests to. Having the Rails API in the middle of it seemed like an additional layer of complexity that was completely unnecessary.

Lessons learned

Lesson learned from this project is an obvious one that I knew beforehand but the knowledge solidified after the experience. Always always ALWAYS ALWAYS ALWAYS (I can’t emphasize this enough) have your data come from a single source. Do not make your source of data a guessing machine (in which this project, the Rails API was a giant guessing machine). Also, if you have to manually go into your production database to delete duplicate data because your system stores duplicate data throughout the day, the system needs to be redesigned.