WiseAnalytics | Azure CosmosDB

Insights

Azure CosmosDB — Lessons learned

6 min read

By Julien Kervizic

Tudo sobre o banco de dados Azure Cosmos DB, da Microsoft

‍

CosmosDB is a fully managed, globally distributed database on Azure. It offers different interfaces to connect to the data, be it SQL, MongoDB, Cassandra, Gremlin… These however, need to be set at the time of creation, after which it is only possible to use the database with the selected API.

Mongo Interface

There are two versions of the interface available for Cosmos DB Mongo API, the 3.2 and 3.6 versions.

The 3.2 interface only supports the basic feature of mongo, but these can be enhanced using “3.4” preview features, enabling, for instance, unique index and aggregation. There are some issues with using this version; however, aggregations, for example, are limited to 40Mb of memory.

The 3.6 interface added compound index, fixed the issue of 40MB and enabled autoscaling, provided the change feed as well as enabled the creation of unsharded collection in database with set throughput. Additionally, it provided previews feature for autoscaling throughput and in enabling analytical queries through Spark. Still, the interface doesn’t offer the full range of mongo’s feature, and some features such as multiple join conditions in lookups are missing.

Compatibility

Not all the commands for MongoDB are available as part of the API. For instance, with db.getCollection(‘foo’).stats() not all the information is surfaced:

Renaming collections db.getCollection(‘foo’).renameCollection('bar') is not supported at all by Cosmos, it is necessary to create a new collection and import/export the data to that collection to do this operation.

Commands such as db.getCollection(‘foo’).getShardDistribution() and db.getCollection(‘foo’).getShardVersion() are not supported. getShardDistribution, for instance, returns that your collection is unshared, even though it could be sharded. The sharding status is however, visible in the Azure portal.

Individual operations within the databases need not be handled directly using mongo, but by using MongoDB extensions, this is notably the case for the creation of Sharded collection.

While the same command in MongoDB would have been:

Throttling

Cosmos has a weird approach to throttling with its MongoDB interface. It uses non retry writes, forcing the retry logic to happen on the application side.

The type of exception that get surfaced is not your traditional 429 error, but rather a standard exception with information about when to retry (RetryAfterMs) the query, which needs to be parsed to be effectively used.

Tooling

Because of the lack of full compatibility with MongoDB some of the tools used to administer it might not work quite as expected. This is the case of the mongoimport and mongoexport commands. mongoimport might, for instance, skip some of the records due to the throttling without retry. Mongoexport might refuse to export some of the collections due to a lack of RU.

Support is better when directly leveraging Microsoft’s tooling, such as Azure data factory.

Sharding / Partition Key

Sharding is important in distributed systems. Cosmos DB however, doesn’t automatically shard the different database collections by default. If you haven’t provided a partition at the time of the creation of the collection, you might be faced with the 10GB data size for a given collection’s partition. The size is calculated base on uncompressed data.

Even while setting up a shard key, a “Partition key” issue may be raised by Cosmos sporadically.

Performing queries with lookup queries (eg: find $in) / joins across different collections with shard keys belonging to different partitions resulted in erratic queries, even with a strong consistency setting and connecting with a Replicaset connection string.

Speed & Performance

Speed and performance within CosmosDB are partially configurable through an RU/S configuration. Coming in the 3.6 mongo interface is a preview feature that allows for autoscaling the number of RU/S between a specific such as from 2k to 20k RU/s.

Besides tuning this performance setting, there are however, a few other factors that can improve performance, such as how the application is designed to leverage the CosmosDB capacity.

Minibatch

Given that Cosmos is a cloud product that can’t be collocated with the application, latency between the different operations can be high. Using a mini-batch approach, boosts performance significantly.

Programming language

CosmosDB has supports for a bulk executor library, meant to speed up certain operations. It is unfortunately only available in Java and .Net and through for the SQL and Gremlin API only.

Index

Index help limit the amount of RU required to handle certain specific operations. They come at the cost of additional space usage, however.

Sharding

Sharding has an impact on performance; the number of RU is equally split across the different shards. This, coupled with data skew and different partition sizes can reduce performance for some queries. Queries across different partitions also end up being less efficient.

Analytical Queries

Cosmos, by default, isn’t well suited to analytical style of queries, with simple count queries on extensive collections exceeding the number of available RU and throwing an error.

Cosmos has a preview feature that offers to use a Spark engine with what Microsoft calls “Analytical storage,” some form of blob storage.

Comparison to Mongo

For the same price, the performance that you get from using your own hosted Mongo cluster is drastically superior to the one obtained with Cosmos DB in the scale of a 5–10x multiplier for demanding workloads. CosmosDB has however, a pretty good latency per queries that is hard to beat.

Cost

Running Cosmos costs can get high. Usage of Cosmos is charged through the RU/s performance metrics. Cosmos has a minimum required RU per collection (100) and requires a minimum RU to be provisioned (400) per database. The space used also has an impact on the amount of RU that need to be provisioned.

There are ways wit Cosmos to partially manage usage costs. It is possible to provision shared throughput databases, which allows us to combine the resources available for each collection and make a more efficient use of the provisioned capacity. The autoscale preview feature is also there to increase capacity on the database and only pay when you need to have the capacity available.

Summary

Whether it is developing a new application or porting an existing one from MongoDB to Cosmos, it cannot be simply assumed that Cosmos will behave like Mongo.

If you have some flexibility on the API choice and the language, the SQL API seems to be more mature than the Mongo API and for the SDKs. .Net appears to be better supported than NodeJS.

Cosmos offers a fully managed database with a MongoDB interface. If you are looking for a fully managed database with a Mongo interface on Azure, there is currently no other choice. If you instead are looking for a real mongo experience, it might be better to deploy a MongoDB cluster directly.

‍