Generic spreadsheet ingestion
Bhavy RaiAs many may have gathered from our previous blog posts, ZooDB is a project that:
“…presents decades of bone count data from zooarchaeological surveys at various sites around British Columbia.”
However, the format of this data might not be what you expect. In academia, it is quite common to collect trial data in Microsoft Excel files due to its simplicity and ease of use. Managing a full-fledged database can be a significant hassle and is understandably not a priority for most researchers, as nobody wants to be bogged down with mundane tasks.
The challenge with Excel files is that programmatically handling them can be cumbersome. We prefer to perform CRUD operations (💩? Not quite, it stands for create, read, update, and delete) on a database instance to ensure data integrity, scalability, and accessiblity, rather than directly in an Excel file. So, how can this be achieved? Let’s dive into it below!
Motivation
As discussed in the introduction, Excel files are prevalent in our field of work. Therefore, the logic used to ingest files in ZooDB will not be the last time we encounter them. However, the current code for ZooDB’s ingestion is not easily adaptable to new projects, as there are numerous precautions taken before modifying a database entry. Despite this, the core logic for Excel file ingestion will remain consistent across projects.
To avoid duplicating large amounts of code across projects, we have focused on
developing a flexible and generic solution. This led to the creation of our
database-spreadsheet integration library, dbssin
, which we’ll refer to moving
forward.
The dbssin
library aims to streamline the process of Excel file ingestion,
providing a robust and reusable framework that can be easily integrated into
various projects. By standardizing the ingestion logic and reducing the need
for repetitive code, dbssin
not only saves development time but also ensures
consistency and reliability in handling Excel files across different
applications. Our goal is to make dbssin
a versatile tool that addresses the
common challenges faced when working with Excel files, thereby enhancing
productivity and maintaining high-quality data management standards.
Reading from Excel to a database
I won’t delve deeply into how to read Excel files with Python, as Priya covered
this extensively in her blog post
about a year ago. However, one detail I can add is that there are many great
alternatives to OpenPyXL
depending on your use case. For instance, if your
application deals with large amounts of data, consider using the
python-calamine
library, which provides Python bindings for a Rust library used to read from
large Excel files. It also integrates well with the well-established
Pandas
library, in
addition to OpenPyXL
, which is fantastic!
Constraints with generic spreadsheets
Much to my dismay, it’s not feasible nor possible (🤖?) to have it so we ingest
an Excel spreadsheet, create the correct model schema for Django, and then read
all our data into the correct fields during run-time. So, instead, we opted
to leave the model definitions to the user, and simply have it so we map fields
in a model, to individual rows of a sheet in the Excel file. Not only is this
more simplistic in an algorithmic sense, but it also gives the user the freedom
to define their constraints, and have dbssin
deal with the validation aspect.
Along with this approach, we also expect that naming is consistent and
respected. Meaning, that if you have a sheet in your Excel file named
Contacts
we expect there to be a model in your models.py
named Contacts
as well, otherwise there is no real way of knowing where each data entry
belongs.
How can you map a model field to a sheet in Excel?
Quite simple actually! When you define a model in the models.py
file like so:
class Contact(Model):
name = CharField(..., help_text='A', ...)
number = CharField(..., help_text='B', ...)
address = CharField(..., help_text='C', ...)
You can pass an optional parameter referred to as help_text
that allows you
to pass in any arbitrary string – in our case, the corresponding row in an
Excel sheet. Additionally, since we established that the name of the model
corresponds to the name of a sheet in the Excel file, we now have the exact
location of a row, and using OpenPyXL is quite trivial in its implementation.
But, if you want to see how we did it, you can check out Priya’s blog post
linked above!
help_text
is a field that is used to provide a human-readable description of
the field, and it is not used by Django for any internal operations. Therefore,
it is a good starting place to scaffold a solution for storing metadata that
can be used by other parts of your application. Many other fields in Django can
be used to store metadata, such as verbose_name
or default
, and you can read
more about them and other such fields in the
official documentation.
Note: help_text
is used in the Django admin interface to provide a tooltip
when hovering over a field, as well as in forms to provide a label for the field,
so be mindful of its usage.
Eventually, we plan to abandon using help_text
in favour of a more robust
solution, but for now, it serves its purpose well, allowing for us to rapidly
prototype and test our ideas.
Creating a database record for a model using Django’s ORM
Django’s ORM
(object-relational mapper) gives a lot of access to the inner workings of its
primitives. For example, if you want to retrieve the name of all your models
you can add the following lines of code to your views.py
:
from django.apps import apps
MODELS = [
model.__name__
for model in apps.get_models()
]
Note: this will not just return the names of those models you defined
in your models.py
, but also all the dependent models that Django uses under
the hood, such as for sessions, so further refinement, i.e. using regex
, is
required to obtain just the information you’re interested in.
To create an entry in the Contacts
model with a set of data, you can
do it in the following way:
from django.apps import apps
# Needs to be consistent with the model definition in models.py
row = {
"name": "Bhavy",
"number": "XXX-XXX-XXXX",
"address": "XXXX XXXXXXXX XXXXXX"
}
# Replace <your_app_label> and <your_model_name> with their respective values
Model = apps.get_model(app_label=<your_app_label>, <your_model_name>)
instance = Model(**row)
instance.save()
So, let’s break this code snippet down:
row
is a dictionary that contains all of the necessary data required by our model, in this case, theContacts
model.- To get the instance of your chosen model, you need to pass in the name of
your Django app, i.e.
dbssin
, along with the name of your model, i.e.Contacts
, to theget_model()
function of theapps
module. - The
**
operator is used to unpack the dictionaryrow
into keyword arguments, which are then passed to the model’s constructor, i.e. passing**row
toModel
is equivalent to passingname="Bhavy", number="XXX-XXX-XXXX", address="XXXX XXXXXXXX XXXXXX"
. .save()
commits our newly formed entry to the database, and with either an SQLite database viewer or access to the PostgreSQL instance, we can see a new entry appearing in our database.
This is a very simplified version of how to create an entry in a model using Django’s ORM, and I would recommend reading the official documentation to gain a deeper understanding of how to interact with the ORM. It’s a powerful tool that can be used to perform a wide range of operations on your database with ease, not limited to just creating entries, but any CRUD operation you can think of!
Conclusion
In this blog post, we discussed the motivation behind creating a generic
spreadsheet ingestion library, dbssin
, and how it can be used to streamline
the process of reading data from Excel files into a database. We also explored
the constraints associated with handling generic spreadsheets and how we
overcame them by mapping model fields to sheets in Excel. Finally, we provided
a brief overview of how to create an entry in a model using Django’s ORM. By
leveraging the capabilities of dbssin
and Django’s ORM, you can simplify the
ingestion process and ensure data integrity in your applications.
I hope that this post has provided you with valuable insights into the
challenges of working with Excel files and how dbssin
can help simplify the
ingestion process.
References
- ZooDB project page
- OpenPyXL ingestion
- Calamine engine for reading Excel
- Pandas documentation
- Django model fields documentation
- Django’s ORM
- Django model instances documentation