WiseAnalytics | As a Data Engineer, I wish Postgres could offer these features

Insights

As a Data Engineer, I wish Postgres could offer these features

5 min read

By Julien Kervizic

Postgres as a database is a very versatile database, with a high degree of extensibility. It can be extended through extensions, UDFs, UDAF, UDT. There are quite a few features not currently available within the native implementation. Not all extensibility options are supported in PaaS (platform as a service) implementations, AWS for instance, doesn’t support PL/Python as part of AWS Relational Databases (RDS).

Some companies such as Uber have explained why they have been migrating their operating data stores (ODS ) off of Postgres, but for Data Engineers, different functionality for a database used as a data warehouse than one used as an operational datastores.

A data warehouse's feature needs come in different flavors; syntactic language simplifies analytical modeling, data structures, and data types that either simply data storage/retrieval or increase performance and general performance optimization for analytical workload.

Postgres has gaps in each of these areas. When choosing a database for an analytical workload, it is crucial to understand each database's key missing features to make an informed decision.

Syntactic language

Functions

Distinct supports in window functions: Currently, distinct count is not supported as part of window functions:

The lack of support of distinct in window functions forces the use of additional subqueries and joins in queries that could have been very simple if distinct was supported.

Map/Reduce and filter: Postgres still lacks ways to do transformations on arrays without unnesting,

DATE_DIFF: Postgres doesn’t offer a similar function natively to date_diff in SQL server. It provides a way to retrieve a time interval by subtracting two dates/timestamp but doesn’t provide the automatic unit conversion offered by the DATE_DIFF function.

NVL2: NVL2 provides a short-hand way to handle both cases when an expression evaluates to null and when it doesn’t. It would be necessary to create a CASE statement in Postgres to provide this functionality, much more verbose.

Clauses

MODEL Clause: Oracle, through the model clause, provides an interface for handling, updating, or complementing datasets. The model clause allows to creates specific rules to be applied on the datasets and provides a way to reference cells within the dataset to define these rules. The model clause can be particularly handy when providing summary time-series forecasts or making inferences on missing data.

FETCH Clause: Postgres implements the fetch clause, although it does not fully support it. One key feature missing is fetching by percent, FETCH PERCENT.

MATCH_RECOGNIZE: The match recognize clause provides a way to do pattern matching. It provides a more straightforward and concise way to define advanced association rules for records than traditional windows functions and CTEs. This can be useful for use cases such as sessionization.

Data Structure & Datatypes

Tables & Table Structures:

Temporal tables: They provide a practical way to handle slowly changing dimensions. Temporal tables offer, at the same time, the lastest view by default and the ability to query the data as of a particular time. Although an extension for Postgres exists with that name, its functionality is somewhat more limited than a native implementation. The features of temporal tables are particularly handy when dealing with training machine learning algorithms and needing to provide reproducible results.

Treatment of views: The treatment of views in Postgres isn’t the best. It lacks supports for predicate pushdown in CTEs containing aggregate functions [1] [2].

Incremental view maintenance (IVM): provide a way to manage better-materialized views. When changes are made in the underlying tables, unlike with the traditional view materialization, which requires a full refresh through the command REFRESH MATERIALIZED VIEW TBL_NAME; Incremental View maintenance calculates and automatically applies only the deltas. [1] [2]

In-memory tables: Some databases, such as MySQL, provide the ability to load entire tables into memory. They can provide significant performance when needing to leverage tables that can fit in RAM across multiple sets of operations.

Clustered index: This type of index forces the ordering of the data contained within the table, providing some performance benefits. Postgres does allow tables to be clustered through the CLUSTER command and does provide a flag for this within the metadata, but this ordering is not strictly enforced and the CLUSTER command need to be manually run.

Buckets: HIVE and Spark offer the ability to “bucket” tables, essentially providing a way to dynamically partition the data based on the values contained in a given column(s). In these big data tools, the use of buckets can yield improved performance when joining different datasets by 1) facilitating data locality 2) increasing the odds that the join will turn into a map join. In a traditional RDBMS, the need for buckets is significantly less, but some operations could still benefit from bucketed tables.

Datatypes

SQL2016 JSON Conformance: There are still quite a few features around SQL/JSON from the SQL 2016 missing from Postgres SQL. JSON_ARRAYAGG with ORDER BY, or support for the PASSING clause to query paths dynamically.

JSON JMSEPATH: Postgres 12 added some additional support for JSON by introducing JSON path query. It is, however, not always straightforward to deal with JSON data directly in SQL. Applying certain transformations requires to unnest the JSON data and recombine it at a later point. JMSE-Path provides a way to not only query the data within the JSON but also apply simple transformations.

JSON Explicit Schema: On Microsoft SQL Server, it is possible to define projections for JSON data using explicit schema. Postgres currently lacks a similar feature

Generalized unstructured data type: JSON/JSONB offers limited type supports. A more generalized unstructured data types would allow retaining complete schema information. [1]

Optimization:

Query hints: Although available through an extension, query/optimizer hints are not available natively in Postgres.

Optimizer: There are a few optimizations that the optimizer currently doesn’t support, such as join elimination, predicate merging

Other

Query variables: Variables that can be used inside a single query without using procedural code. Postgres doesn’t currently offer this feature, which forces the use of CTEs or inlining these parameters in an external code template (e.g., using jinja templates).

‍