With the increased digitalization and data use cases stemming from it, the field of data engineering is becoming quite in demand. Yet more often than not, hiring managers and companies don't fully grasp the nuance of the field. There are many different data engineer archetypes, and while true generalist exists, typically, a Data Engineer will have particular expertise and affinity towards one of the areas of Data Engineering.
The data warehouse archetype for Data Engineers is an archetype where data engineers primarily deal with databases; their focus tends to be on data integration and data modeling.
They primarily work with RDMS such as MsSQL, Oracle, or Postgres, know the in and out of ACID properties, transactions, data modeling methodologies such as Kimball or Inmon, optimize queries through reading explain plans, applying indices, partitions, etc. This archetype sometimes gets involved in database administration tasks, including user provisioning, backup, recovery, migration, etc.
Data Engineers typically tend to work with tools such as SQL Server integration services (SSIS), Procedural SQL, … although some companies are progressively migrating towards a modern data stack including tools such as DBT for data modeling purposes. With the move towards the modern data stack, this archetype is being pushed towards the direction of analytics engineering.
The Data Integration archetype typically works on bringing data onto a data platform (ETL/ELT) or from a data platform (reverse ETL).
The engineers fitting in that archetype would typically use frameworks and technologies such as Singer taps and targets, orchestration such as Airflow, Azure Data factory or Logic Apps, CDC toolings such as Debezium or AWS Data migration services, or reverse ETL tooling such as Rudderstack.
Some of the work typically done by these engineers involves calling APIs to source or push data, creating FTPs feeds, or setting up data crawlers. They might as well be quite familiar with the integration of typical file formats used to exchange data or specialized file formats used in medical data exchange or financial flow.
It is becoming increasingly important to process data in (near) real-time, from the most basic use case of offering live dashboards to customers to providing personalized recommendations or setting up complex alerts or processing rules.
This gave rise to a specific archetype of Data Engineering focusing on real-time. They leverage processing frameworks such as Apache Spark, Storm, Flink, or Beam. This is usually coupled with message brokers technologies such as Kafka, Pulsar, Kinesis, or the more traditional RabbitMQ.
On the datastore side, they understand how to leverage NoSQL, datastore, search engine, memory, caching technology such as Memcached or Redis, or in-memory data grid such as Apache Ignite or dealing with time-series databases such as Druid.
They understand the specificities of dealing with real-time datasets, on the modeling side, by having a thorough understanding of window types (tumbling, hopping, sliding, session ) and streaming join types (stream-stream, stream-table,…)…, architectural patterns fitting for real-time processing such as Kappa and Lambda, …
The data engineers working with real-time data typically have strong software engineering skills; this is required to handle the complexity of dealing with real-time data both in terms of performance but specific challenges that come along with it (e.g., out of sequence processing), as well as the challenges tied to operating on a live production system.
The rise of technologies such as IoT and sensors is creating an increasing need for this type of expertise in the market.
Data Engineers fitting within this archetype are often referred to as machine learning engineering. Their focus is on the productionalization of Machine Learning or Data Science use cases.
Machine Learning engineers typically spend their time productionalizing features, integrating them onto a feature store, automating the model training, working on any of the different means of serving the predictions, and overall model maintenance and performance monitoring.
They typically leverage technologies such as AWS SageMaker, Kubeflow, and ML flow to automate some of the different steps of data transformation, model training, and inference. Depending on the data size and the complexity of the model, they would be leveraging libraries such as Pandas, Spark, or Keras/Tensorflow.
With the increasing digitalization, more and more opportunities for automated decision-making lead to high demand in the area. This is an area that is coupled with revenue and profit growth. Embedding raw data with insights typically gives rise to better overall topline and bottom-line performance for businesses.
Some of the typical use cases that need to be implemented by Machine Learning engineers are recommendation, propensity modeling such as churn prediction, to dynamic pricing use cases…
With an increased amount of data being collected, different methods are needed to manage and process large datasets. This gave rise to specific technologies such as Map/Reduce, Hive, MPP (Presto, Snowflake, Redshift), or Spark.
Data Engineers falling within this archetype need to have a thorough understanding of distributed computing and need to understand the different job stages of their specific tools, the underlying technology landscape, and their components such as Zooker, Hive Meta stores,… and should be able to optimize their execution when challenged by topic such as data skew. These data engineers need to be aware of the different file formats (Parquet, Orc, Avro) used for Big Data and their recent evolution (Delta Lake, Iceberg, Hudi).
Dealing with big data requires data engineers to think about different concepts that occur with distributed systems and think about specific computer science problems, such as probabilistic data structure, not often found in traditional software or data engineering roles.
Cloud Infrastructure is becoming increasingly relevant for Data Engineers. With the move out of on-premise Hadoop cluster solutions like Hortonworks, Cloudera, and MapR, and onto cloud PaaS solutions such as AWS, Azure or GCP comes an increased ability for Data Engineers to self-service and orchestrate their own tooling and processing clusters.
Particularly in smaller-sized companies, Data Engineers often have to leverage infrastructure, from working with serverless applications to handling EMR or Spark clusters or setting up some Data or Machine Learning toolings such as Airflow, Kubeflow, or DBT. In specific situations, Data Engineers might need to set up and maintain their own distributed datastore clusters or set up message brokers to transfer the data over between different applications.
The technology mastered by these data engineers is typically quite focused on infrastructure as code, looking for knowledge of cloud formation, arm templates, terraform, ansible or helm. They might have specific experience setting up, configuring, and maintaining cloud services.
Involve in building the tooling and framework necessary to process and expose data. They build custom software to allow, for instance, to run A/B testing software, simulations, build microservice APIs to surface the data, etc. They might contribute to larger companies' specific data infrastructure tools or to open-source tools to bring about particular improvements.
The Data Engineers fitting within this archetype typically bring about a solid command of their respective programming language, coupled with a master of software engineering principles such as SOLID, design patterns, …