Shiro's blog

How to sort lexicographically in MongoDB ?

TL;DR

Introduction

This solution relies on case-insensitive indexes to execute the sort on the database side.

Let's start with a basic sort and see what changes we need to make to change it into a lexicographic sort.

from pymongo import MongoClient

NAMES = ["Alice", "pomme", "Paris", "Bob", "oreiller", "chariot"]


def main():
    collection = MongoClient()["lexicographical_sort_test"]["test"]
    collection.insert_many([{"name": name} for name in NAMES])
    for doc in collection.find().sort("name"):
        print(doc["name"])
    collection.drop()


if __name__ == '__main__':
    main()

The above snippet connects to a local mongodb instance, insert a few documents, print them sorted and drop the collection.

As you can see if you run it, documents' names are not printed in expected order, but with capitalized words first.

Implementing lexicographic sort

Feature compatibility

First, you have to make sure to have a MongoDB version at 3.4 or above. But this is not enough if you had an earlier version installed. You will need to tell mongodb that he can use its new features from the 3.4 version, using the setFeatureCompatibilityVersion command.

To summarize, you have to run the following command in MongoDB's shell:

db.adminCommand( { setFeatureCompatibilityVersion: "3.4" } )

We are now allowed to use v2 indexes, which will be very useful in the next part.

Make a case-insensitive index

With the new indexes, it is possible to have a different collation from the collection. This allow us to create a case-insensitive index by changing the strength of the collation to 2. This tell MongoDB to compare strings without taking case into account.

collection.create_index("field", v=2, collation={"locale": "en", "strength": 2})

Apply collation to query

With the index created, you could think that our work is done and that we can go home.

Wrong. There's one last step.

You need to apply the same collation in the query as the one used in the index. This can be done easily by applying the collation method on the cursor.

cursor.collation({"locale": "en", "strength": 2})

Final snippet

We now have a working lexicographic sort, computed in MongoDB. Here is the snippet you get by applying the above modifications to the snippet at the beginning.

from pymongo import MongoClient

NAMES = ["Alice", "pomme", "Paris", "Bob", "oreiller", "chariot"]
COLLATION = {"locale": "fr", "strength": 2}


def main():
    collection = MongoClient()["lexicographical_sort_test"]["test"]
    collection.insert_many([{"name": name} for name in NAMES])
    collection.create_index("name", v=2, collation=COLLATION)
    for doc in collection.find().sort("name").collation(COLLATION):
        print(doc["name"])
    collection.drop()


if __name__ == '__main__':
    main()

Conclusion

There are other valid solutions, like storing the same field in lowercase to sort using it. But I've seen too little documentation on this method and I wanted to put an easy guide out there.

This was also my first occasion to write a blog post, so I hope you enjoy it!