Data Engineers need to understand the different files format they need to read from and write to, as well as the limitations and pitfalls of each format. They need to be able to judge which format would be the most suitable for the task at hand and understand how to interact with each of these file formats.
Data engineers tend to deal with some of the more typical file formats when looking to ingest data: CSV/TSV, JSON, XML, and XLSX file formats. Many other file formats such as Parquet, Avro, or ORC tend to be used but are more typically found in a DataLake setting rather than for the exchange of data. These will be the subject of a future post.
CSV is a text format, with fields typically separated by a comma ',' (although other separators are possible) and new rows provided by a new line ' ' (typically). TSV is a similar format, with fields separated by a tab ' '.
The CSV format provides for an optional quoting of fields and a customizable quote character.
CSV is one of the most ubiquitous file formats, and it is supported by a large number of applications and connectors.
Another advantage of CSVs is that they can be read iteratively, and therefore memory consumption can be optimized. They also have no size or number of record limitations, although some software consuming the data might have restrictions.
The CSVs format is uncompressed, resulting in a large(r) file size. Large CSV files are typically further compressed (typically through zip or gzip) to downplay this issue. Pandas’ read_csv function supports both compressed on non-compressed CSV files.
CSV contains only 'text' data and doesn't support native binary data, although base64 encoding is often used as a workaround to include this type of information. Being a text format, with specific special characters such as the separator or delimiter, make it so that any entry has the potential to 'break' the file, mainly when text fields are unquoted. Particular attention needs to be placed on sanitizing the fields to minimize this risk.
Another downside of CSV/TSV is the poor support for structured metadata. Some use header rows to provide some metadata, but this doesn't have a commonly agreed format or have any pre-defined attributes. More standardized metadata such as column names need to be explicitly passed when processed.
There are multiple ways to deal with CSV in Data Engineering, from loading the data directly, leveraging processing engines to go through collections of CSV, processing them through Pandas, or leveraging connectors for their ingestions.
CSVs tend to be natively supported for loading data from Databases, an example in Postgres:
They are also supported by query engines such as AWS Athena, and it can, for instance, leverage a dataset defined as a table:
A query can then be performed on the policy table using the tool.
Pandas support reading and writing to CSVs using to pd.read_csv and df.to_csv methods. It supports both full load or in chunks, as well as reading/writing compressed files (zip, gzip, …).
Singer is an ETL framework, offering connectors to different data sources and sinks. They leverage the naming taps and target for the source and sinks, respectively. Singer provides good support for CSV with an S3 CSV tap and a CSV target. Below is an example of extracting data from Facebook to a CSV file:
JSON is a text format that is derived from Javascript. It provides to store and transmit data objects. It is meant to give a representation of more complex things, which can be serialized or deserialized at will.
There are many applications for leveraging JSON for data transfers; APIs particularly leverage it to transfer data. It is often used as a format of choice for passing messages, for instance, onto Kafka.
The JSON format offers a flexible schema evolution of this schema. Compared to CSV, which only support 'text' types. JSON supports a few basic types (string, number, object, array, true, false, and null). This allows the format to handle more complex data structures natively than CSV. The JSON format also has a binary format(Binary JSON or BSON) supporting many native types (e.g., Datetime).
JSON files due to the nature of the serialization, deserialization tends to be more robust than CSV.
One of JSON files' disadvantages is that they need to be read in full unless using 'Line Delimited JSON' (requiring lines=True in pandas and then supports chunk size reading).
Similar to CSV, the files are not compressed by default but can be placed onto archives. The JSON format being a text format, it doesn't correctly support binary data, and base64 encoding is used to transmit this type of information. There is also limited metadata support, although using a separate schema definition file (see JSON-schema extension).
Data engineers can use quite a few tools to deal with JSON.
JSON is natively supported by programming languages such as Python. In the example below, you can see how to export an object onto a JSON text.
jq Is a command-line tool to operate upon JSON files. It allows to query the inner content of JSON messages and retrieves the information from even nested objects.
Pandas has support for JSON files and can read and write the files through the read_json and to_json functions. It converts the file to a data frame based on an 'orientation,' resulting in different data frames generated based on this setting.
Iglu is a schema registry for JSON schema used by Snowplow. Snowplow is one of the most common clickstream event ingestion pipelines and collectors. It leverages Iglu and JSON Schema to validate that the incoming data is structured as expected.
Most databases now provide some form of support for JSON. Postgres supports it through specific operators, functions, and types. NoSQL databases such as Dynamodb or MongoDB also tend to provide good support for this data format.
The above code shows an example of how to extract data from a specific JSON field.
XML was built to separate data from HTML. It is now used in a variety of places, such as in SOAP API, as part of the OData (Open Data) Protocol, within other formats such as Excel's XLSX format.
XML is a text format that offers a flexible schema, allowing the embedding of complex types. XML also provides schema validation, performed using Document Type Definition (DTD).
XML can be quite verbose and can lead to larger file size, this is exacerbated by the fact that XML usually is non-compressed, and compression support would need to be obtained through zip, gzip, … archiving.
XML might be hard to iterate and process iteratively depending on the structure. However, it is typically better than JSON as specific parsers such as Sax and StAX enable some form of streaming.
Most databases offer a way to process XML documents. Oracle, for instance, supports this with XQuery and XMLTable. It can serve to handle more complex types and data structures than officially supported by the databases.
Although XML is natively supported in Python (Using Element Tree API), other Libraries such as LXML exist offering additional functionalities and typically higher performance. Similar to JSON, pandas provides a way to interact with XML documents using this lxml library and is probably one of the most common ways data engineers would interact with the language for data processing the data.
XML is also used with SOAP API; in Python, the Zeep library provides a client for dealing with this type of API.
XLSX has been the default native file format for Microsoft Excel since its introduction in Excel 2007. At the time, it provided several advantages over the previous default file format XLS, such as supporting an increased number of rows per sheet. A similar file format XSLM provides additional macro supports.
The file format is compressed. The XLSX format is a collection of XLM files compressed in a ZIP archived. It is possible to decompress the renamed zip file to see the XML content within it.
Excel is a popular tool, and this file format being native from there makes this often the natural export format for some datasets produced in excel. Planning data, for instance, are often created in excel. The XLSX format makes it very easy for users to directly change the data contained in it by just opening the files directly with Excel.
It also offers the ability to contain different datasets in it (sheets) and compute specific values from others. The XLSX format supports some metadata, and additional metadata information can be added to a separate sheet.
The excel format provides a certain level of type. For instance, it is possible to define values to be numerical, text, or a date. It is also possible to separate the visualization of the record from its storage (custom formatting).
The XLSX format has strict limitations in the number of rows and columns each sheet can contain and the number of sheets. It is limited to 1048576 rows, 16384 columns, and 255 sheets.
There can be issues dealing with varying data types, as it is possible to change individual records. The tendency to have files in that format manually produced rather than directly through a system can make the format quite tricky to use for ingestion purposes.
The format is meant to be read entirely in memory rather than iteratively by rows or columns, making it inappropriate for dealing with larger datasets.
Pandas provide utility functions to read and write to excel XLSX files directly to/from a data frame. Pandas under the hood leverage OpenPyXL, a python library to read and write XLSX and XLSM files. Despite some feature limitations, pandas offer an easy way to interact with XL datasets. One of its limitations is its inability to write natively to Excel tables. (also called data tables). This can be palliated by leveraging libraries such as PandasPyXL, which provides a wrapper around pandas and openpyxl for such use cases.
CSV/TSV, JSON, XML, and Excel files are some of the most common file formats data engineers deal with when dealing with data ingestion tasks. There is a wide array of file formats with specific advantages and disadvantages.
Most of these formats are typically well supported in databases and data processing libraries such as pandas. However, data engineers need to understand the different pitfalls with each of these formats to choose the most appropriate format given the use cases at hand.