WiseAnalytics | On Data Engineering code reviews

Insights

On Data Engineering code reviews

6 min read

By Julien Kervizic

It is essential to do code reviews in Data Engineering. Code reviews provide a good foundation for the future, especially when looking at real-time use cases and a way to avoid regressions now.

Data Engineering code review can be similar to code reviews for Software engineering, but since Data Engineering deals with a higher degree of unknown the coding style often needs to be more defensive. The focus and priority are also often quite different than in software engineering.

A guide to code review. Me and code reviews have had a troubled… | by Robbie Heygate | Engineering at Birdie | Medium

The four pillars of data engineering code review

There are different pillars of code review for Data Engineering code: Conformance, Engineering, Logic, and Scoping, where the focus differs from traditional engineering in that the data dictates how some of these should behave and therefore needs to be incorporated in each of these aspects.

First Pillar — Conformance (Style & conventions)

The first pillar of code review for data engineers is accessible for feedback from even the most Junior Engineers. This first pillar of code review focuses on code style, consistency, applying the correct conventions, readability of the code, comments, documentation, and providing appropriate namings.

CodeStyle matters: The importance of having a consistent code style is something well understood within software engineering. Data Engineering doesn’t differ in that respect, and Linter and formatted toolings such as SQLFluff or Black typically help in that respect.

The importance of naming in Data Engineering: Similar to software engineering, naming is important within the code and the different data assets created as part of the data engineering process.

Contrary to Software engineering, in Data Engineers, the specific names used might need to be constantly typed by the data consumers. It might also be more challenging to change these names. Unlike traditional web APIs, Data Engineering does not have a universal and well-accepted way to deal with the versioning of data structure/assets. Schema registry exists and supports versioning, but is not ubiquitous across the data landscape.

A ubiquitous need for documentation: There is a high need for documentation in data engineering to explain the specific transformation applied, explain the handling of data exceptions, or describe the different data assets created and meant for consumption.

Second Pillar — Engineering

The second pillar of code review for data engineers focuses on the more engineering aspects of the matter. There does not need to be much understanding of either the data or the business context associated with the code for this to happen. Therefore, it is a pillar that is reasonably accessible for new starters with a decent level of engineering craft, to start contributing,

There are many aspects within the engineering pillar to consider, performance, testing, code duplication, approaches to dealing with problems, or patterns and methodologies.

Performance: with regards to the engineering aspects, performance is a crucial aspect to look at. Looking at queries explain plans, at how the implementation is working with distributed systems, and if there are any issues in the code such as abusive use of collect() statements in Spark, how the code would scale to an increased volume of data, or whether the tables created need an index or partition added or if there is a data skew that needs to be addressed.

Testing: Data Engineering deals with a complete set of testing strategies, yet testing is often something forgotten as part of Data Engineering practice. Unit testing pipelines and data quality testing are vital for data engineers; the latter can be seen as a more a hybrid form of end to end testing and monitoring.

Code and data structure duplication: Removing duplicate code and/or abstracting onto the re-usable structure

Design Patterns & Data Modeling Methodologies: abidance to design patterns in programming and data modeling methodology in table designs is essential to review and how they are implemented. For Data Engineering table design, of particular importance is the concept of granularity. At what level of granularity is your dataset, whether the transformations conform to that level of granularity, and whether that grain is adequate for what is being modeled. Another critical component is the application of surrogate keys and how they fit in the bigger data lake/warehouse landscape.

Alternative Approaches: Is there an alternative approach that might be more suitable? Would the problem be more easily solved by, for instance, using an SCD2 table as a base or an allocated fact table?

Omissions: It is often the case that some necessary code change didn’t end up in the merge request, either because it didn’t end up being saved or pushed correctly or simply because it wasn’t considered during the development part.

Third Pillar — Logic

Similar to what we have in software engineering, performing a code review for data engineering involves double-checking the logic used.

Does the code do what it intends to do overall, i.e., does it match the requirements? Does the code handle the different edge cases possible? Does the code change the logic of what has been implemented — what is the downstream impact of such change?

When looking at the logic aspect of code review for data engineering, there is, however, an added layer of complexity in that the logic being implemented needs to also tie-up with the data available, as well as be robust to the logic of new incoming data, i.e., how does the code deal with the “unknown.”

In traditional software engineering, a layer of validation typically happens at the API layer providing direct feedback as to whether a request is considered valid. While similar validation can also occur in Data Engineering, they are usually not addressed in the same way due to the nature of the use cases and a preference for different tradeoffs.

In a reporting use case, for instance, rejecting a given record means we are trading off completeness for accuracy of the individual records — this is usually not the case in a product application as the application could be considered the system of record and the source of truth. On the application side, other tradeoffs would exist, some of which would have a monetary impact against the accuracy of these records.

Fourth Pillar- Requirements & Scoping

Something to look for when reviewing data engineering code is that of requirements and scoping.

Some of the things to pay particular attention to: are the requirements used to complete the merge request complete? or is there something that may appear missing? While the same is true for Software Engineering, this aspect is of particular importance for Data Engineering in that the requirements need to be correlated to the data. Exceptions could exist in the data that can fundamentally change certain things that should be addressed. Think, for example, of some data having a different granularity than expected for some records. Not taking into account how these records should be treated could create JOIN explosions downstream, impacting numerous datasets.

It is also essential to look at the potential extensibility of the code or data structures for new requirements. Anticipating future requirements is vital to minimize the need for doing extensive backfills. This also applies to the placement of business logic. The requirements can change, and the business logic would need to be updated — placing the business logic as downstream as possible limits the amount of data structures that would need to be backfilled and tables/datasets that would need to be updated.

Doing Data Engineering code review.

A data engineering code review is more involved than just looking at the code; it requires reading the requirements, running the code, looking at the input and output data created, and seeing if there are any gaps within.

Reading the code provides one layer of value. It allows to give feedback on the overall approach taken, whether the code matches the code style and approach agreed within a team, identify some logic mistakes, or propose alternative methods that could end up being more efficient.

Reading and analyzing the requirements provide another layer of safety that the code will be able to handle the necessary use cases and be more robust for future changes.

Running the code is another layer. It allows first to see if the code can be run on a different environment if the code is well documented or if there are gaps that need to be set up and identify gaps in files not being correctly committed.

Last but not least is the importance of looking at the data, both inputs, and outputs. Compared to Software Engineering, there is less opportunity to control inputs in Data Engineering. Data Engineers must pay particular attention to understanding the incoming data in the pipelines and how it is transformed. Doing data profiling or reconciliations when doing code reviews is quite typical in data engineering code reviews.

‍