Case: Two Scenarios

How to save a tech demo in 72 hours

Disclaimer: The data used in this case study is based on publicly available sources, and details of the scenario have been adjusted for illustrative purposes.

Technologies used

  • MySQL
  • Python

Background

The client, Two Scenarios, needed to import historical pharmaceutical pricing data into their own database to be used in Tableau.

With the original tech demo postponed due to timeline challenges, we were brought in the following Monday to help complete the project and ensure everything was ready in time for the rescheduled demo by the end of the week. Along the way, we also introduced several optimisations to the codebase to improve maintainability and performance.

Challenges

We encountered the following challenges during the project:

  • Non-UTF-8 encoded data: Pandas and other Python libraries assume UTF-8 encoding by default, so importing the data caused errors when it encountered non-UTF-8 characters. The previous developer had attempted to fix the encoding problems by manually replacing characters, but this approach was unsuccessful. We had to do a bit of detective work to identify the original encoding and then re-encode the files to UTF-8.
  • Tight coupling in the existing code: The existing codebase was tightly coupled, with both the FTP file fetching, parsing and database insertion happening in the same loop. This made it difficult to separate the concerns and implement the missing functionality.
  • Suboptimal database schema: The original solution inserted data by exporting a CSV file from Pandas and then inserting it as a single table into the database. This approach resulted in significant data duplication, as the static information of every company and every product was repeated for every single timestep. Given the volume of data and Tableau's challenges with large datasets, we needed to find a way to significantly reduce the amount of data stored in the database without losing any information, so queries in Tableau would remain performant.

Solution

First we needed to identify the root cause of the non-UTF-8 encoding issue since it was unsustainable to try to filter out characters and check if strings are valid UTF-8 at every step. The three offending code points, corresponding to the letters Æ, Ø, Å, were all invalid in UTF-8, so the first initial guess is that the data was encoded in ISO-8859-1, another older encoding used for Danish text, but the region where these code points exist is also undefined in ISO-8859-1. We then did some more research and came across CP865, which was a 1-1 match with what the files were encoded in. We could then read these files, and then output them as UTF-8, which resolved the majority of issues the original developer had encountered working on this project.

Before doing any more work, we needed to identify which columns were static and which ones actually changed over time. This revealed only the pricing and "packing code" changed over time. This also revealed a potential data anomaly which could be that a company changed its name and prior data wouldn't link to new data. We then designed a new normalised database schema as shown below:

Two Scenarios Data Import

This change removed the potential of data anomalies and reduced the database size by an order of magnitude. These savings also directly correlated to the amount of time it took to load the data into Tableau.

Another insight, was that every new file of data from the data source, contained a full snapshot of all prices and codes, even if a product didn't change since last snapshot. This meant that we could get additional savings by deduplication and only storing the rows that are different from the previous snapshot. This combined with normalising the database, reduced size in total by about 33x as can be seen in the graph below:

Storage costs

We then split up the tightly coupled logic into a file-fetching service, database insertion service, and a parsing module (to handle the format of the data). This allowed us to more easily test the individual components, and speed up the application itself since we could now run the file-fetching concurrently with the database service using a producer-consumer pattern.

This also made it possible to implement caching to avoid having to re-fetch the same immutable data over and over. Critically, this decoupling made it possible to process the data in batches from most recent to oldest, so that it was possible to populate the database incrementally in preparation for the tech demo. This was important since it was more important to have some of the recent data available as soon as possible, than to have all the data available at a later time.

Everything but the row-deduplication was completed within the first 48 hours of the project.

Conclusion

The project got completed within budget and with enough time left for the client to implement the tech demo. Beyond immediately making it possible to finish the tech demo in time, the codebase was refactored to be significantly more maintainable and versatile. The more efficient schema also enabled the client to process more data without overloading Tableau or having to sacrifice data quality.

Some takeaways

  • Investigate bugs and find the root cause of problems instead of patching them.
  • Identify what assumptions can be made about your data, and use them to your advantage.
  • Separate your concerns to improve readability and maintainability.