Insights
Python’s Data Classes a Data Engineer’s best friend
5 min read
By Julien Kervizic
Dataclasses — An Awesome approach for OOP in Python | by Jai Kishan | Medium

Data classes are a relatively new introduction to Python, first released in Python 3.7 which provides an abstraction layer leveraging type annotations to define container objects for data. Compared to a normal Python class, data classes make do of some of the syntactic sugar for instantiation, and there are a number of areas where data class can add value to data engineering.

Understanding Data Classes

Data classes

The data class library introduces a lightweight way to define objects, providing getters and setters for the different fields define within it.

As shown above, it relies on a decorator pattern to wrap around classes and enrich them with specific features.

Data class and field definitions

The data class leverages a series of fields defined within the class along with their Python-type annotations.

The class can then be instantiated by providing a variable customer_id as a constructor argument:

For each of the fields defined within the data class an @ property accessor and setter are defined. The data can therefore be retrieved in the following manner:

It is possible to extract the different list of fields defined within the data class using the __annotations__ setting.

The __annotations__ property would provide raw annotations. There are however cleaner ways to resolve the different field types within a data class.

Data class as definition objects:

Data classes can also serve as definition objects the init constructor argument determines whether the data class will be automatically initialized. In order to use the data class in a full definition mode, it is also required to disable the repr as the initialized values are by default outputted as part of the class’s string representation.

Data class and meta fields:

Data classes can leverage some extra properties defined as fields , to add features such as default values, default factory, or more importantly when leveraging data classes as definition objects, metadata properties. The information defined within these field’s metadata can be retrieved from the class in the following manner:

Leveraging Data classes for Data Applications

Type validation:

We can use data classes to implement type validation. A. specific library dataclass-type-validator exists to help support this use case.

We can also leverage data classes to validate data we would like to ingest for example after having ingested it on a data frame:

dtypes specifications:

Sometimes it is important to not only be able to leverage the native annotations for types but to enrich the information with specific dtypes when the data is loaded onto a pandas DataFrame for instance. Pandas read_csv function for instance let us provide a dictionary of {“column_name”: “column_dtype”} when reading the file to create a data frame. These can be inferred from a data class when specified for instance in a metadata property.

This can be used to have the right dtypes for instance when needing to specify null or non-nullable integer values and be more memory efficient than the automatic type conversion of pandas.

SQLAlchemy Models & DLL:

It is also possible to generate a SQL alchemy model dynamically out of a data class.

The model generated is a pure SQL Alchemy model and can be instantiated like a normal model: SampleModel(customer_id=220) . This model can also be used through Alembic to generate schema migrations.

The Pydantic framework provides a more direct way to leverage the type annotations to generate similar models. A decorator coming from the library being sufficient to allow the generation of the model.

Another use of the SQL Alchemy annotations in the data is to leverage them to write to a table using Pandas data frame with specific types. This can be done by leveraging Pandas to_sql and the dtype property.

Protobuf:

A specific library called pure-protobuf exists that allows translating data classes into Protobuf. Protobuf is a protocol that facilitates data exchanges across application/programming languages.

Simple ETL

Simple ETL processes can also be defined and described as part of a data class. Operations such as type casting, renaming… quite a few operations can be defined for simple ETL as part of a data class.

API

Leveraging FastAPI and Pydantic, it is possible to leverage data classes to build APIs in a streamlined fashion.

Summary

Data class provides a versatile abstraction for dealing with data schema and its downstream transformations. Through adapters, it is possible to leverage them for schema validation, DDL, APIs, or message passing. They should form part of the swiss army knife that data engineers working in Python leverage.

Privacy Policy
Sitemap
Cookie Preferences
© 2024 WiseAnalytics