Django ORM Intricacies

Nick Wurzer

Recently I did some work on our models and ingestion script and learned about some unique cases in Django’s Object Relational Model (ORM) which I’d like to describe here. For an overview of the ingestion process see Priya’s blog post.

There were three main goals to this work.

Firstly relational constraints should be enforced in the database so that it is not up to the developer to maintain relational integrity. If the developer tries to make a change to the database that is incorrect, the database should throw an error.
Secondly these errors should be handled for both Postgres and SQLite3 because we use SQLite3 in development but Postgres in the production environment.
Lastly, the spreadsheet being ingested should be accepted or rejected as a whole so that it is known which changes from the spreadsheet have been applied to the database.

Constraints and Relational Integrity

To address the first main goal of trying to create objects and relying on relational constraints to check if the object creation is valid, UniqueConstraints and CheckConstraints were used.

UniqueConstraint

Our models had the following relationships that needed to be enforced:

A site has a unique and required code field.
A site may contain many Excavations, and each excavation’s required name field must be unique within the site.
Similarly, an excavation may have many Samplings, and each sampling’s name field must be unique within the excavation.
A sampling’s name is not required, but there can only be one NULL sampling name per excavation. The NULL sampling is one which is dubbed combined and it contains all bones in the excavation.

If these conditions are not met, then ingestion will interpret the object as a duplicate and try to verify that the data in the spreadsheet matches the database’s.

I’ll only cover the constraints for a sampling because it can be extrapolated to sites and excavations too.

The constraints under the Samplings model are:

constraints = [
            models.UniqueConstraint("name",
                "excavation",
                name="unique_name_excavation"),
            models.UniqueConstraint("excavation",
                condition=models.Q(name__isnull=True),
                name="only_one_NULL")
        ]

The first constraint, aptly named unique_name_excavation, requires that the combination of a sampling’s Excavations and name must be unique. So the first condition that a sampling must be unique within an excavation is met. This same concept was applied to the Sites and Excavations models.

The second constraint requires that a sampling’s excavation must be unique on the condition that the sampling’s name is NULL. So now there can only be one NULL sampling name per excavation.

The first constraint could be thought of as a compound primary key, since it requires uniqueness in two attributes from different entities. Django doesn’t support compound (or composite) keys though, so it has to be enforced with UniqueConstraints.

Choices and CheckConstraint

One of the relational constraints that I thought was enforced in the database out of the box, but is not, is Django’s choices field option. So in this update to models and ingestion I made a CheckConstraint to do that.

This project has a model called Excavations which, crudely put, represents a hole in the ground which archeologists have excavated. The archeologists take the excavated soil and screen it through a wire mesh which is broadly catagorized as coarse or fine. So in the excavations model there is a field

screen_size = models.CharField(max_length=6, choices=SCREEN_SIZE_CHOICES)

where

SCREEN_SIZE_CHOICES = [
        ("FINE", "Fine"),
        ("COARSE", "Coarse")
    ]

This field definition is great because it enforces the two choices in any forms, but it does not actually make a constraint on the database. Django assumes that the developer will only use the proper values for the screen size in any direct updates to the database. This is problematic for the direct updates that happen with ingestion because any value will be ignested without throwing an error. One solution is to write a checkConstraint for the field as follows:

models.CheckConstraint(check=models.Q(screen_size__in=ScreenSize.values),
                name="valid_screen_size",)

Now nothing other than FINE or COARSE can be stored in the database and an attempt to do so will result in an IntegrityError with message: CHECK constraint failed: valid_screen_size.

Handling Errors with Different Databases

To address the second main goal of making ingestion more resilient, errors had to be caught for SQLite3 and Postgres. Django categorizes each error that the database throws into one of the subclasses of DatabaseError. This way the same error semantically is also the same Django error type even if the database errors differ.

For example one subclass of DatabaseError is IntegrityError which we saw as a violation of a UniqueConstraint. Unfortunately, though Django helpfully classifies all violations of UniqueConstraint as IntegrityError, there is no way of knowing which constraint was violated without checking the original database error message.

If the error was caught with except IntegrityError as e: then the error message could be cast to a string with str(e.__cause__) and the message can be compared to the expected message.

Accepting or Rejecting the Entire Spreadsheet

The third main goal was to rollback any changes and reject the whole spreadsheet if any errors occur during ingestion. This was acheived by wrapping the whole ingestion code in a with transaction.atomic(): block. I quickly found out that catching DatabaseErrors inside this block resulted in a TransactionManagementError. I’ll do my best to explain it but you can read more in the ‘Avoid catching exceptions inside atomic!’ section of this Django documentation page.

Basically, Django expects you to catch errors outside of an atomic block so that it knows how far to roll back changes. If you catch a DatabaseError inside of an atomic block, then you are hiding from Django the fact that a DatabaseError occured. If this happens, then Django cannot properly roll back the changes in the ORM to reflect the changes in the database.

The remedy is simple. Wherever the DatabaseError (or its subclasses) occurs, wrap it in an atomic block so that Django knows that the error occured. This may result in nested atomic blocks and that’s fine. If an exception is caught between an inner and outer atomic block, then the changes from the inner block will be rolled back, but the outer transaction will be free to proceed and make changes to the database as long as no other errors occur in its scope.

Merge Migrations

Although slightly unrelated to the main goals of this work, I also came across a merge migration while doing this work so I thought it would be a good idea to cover that. I thought it might be scary, but it’s not at all.

Say I’m making changes to the Sites model field code and my coworker Jim is making changes to the Excavations model field name. The last migration made was migration number 10, so we both make a migration number 11 that depends on migration 10. Jim merges his code to the main project and then I do. We created migrations files named two different things so there is no merge conflict in git. When we try to migrate changes to the model in the main branch though, Django will complain that there are two migrations with the same number and that you need to resolve this. There are many solutions to this, but one is to use the management command makemigrations --merge. What this does is create a third migration (number 12) that has no operations, only dependencies. Migration 12 depends on both Jim’s migration 11 and my migration 11. It is okay that there are two migration number 11s because they have different names and there no conflicts in the files because we edited different fields. This way the migrations have been linearized and all dependencies remain intact.

A problem could arise if Jim and I edited the same field of the same model. At that point I don’t think Django would know how to resolve the conflict. I don’t have experience with this but I think one way to resolve it would be to delete the new migrations and remake them with makemigrations. Alternatively, you could manually edit the migrations files to resolve the conflict.