Turning a Spreadsheet Into a Postgres Database

Stephen Neale

Hello! Stephen here. From September to October I have been working on a database of excavations and bone counts for the archaeological department here at the University of Victoria–a project dubbed “ZooDB”. In this blog post, I will be going over how I went about one of my first tasks: automatically turning a spreadsheet of data into a Postgres database.

The finished “inputsites” script and tests along with instructions can be found in the ZooDB git repository.

Ingestion via Python

The first step of this project will be to create a script that can automatically ingest a spreadsheet into workable SQL code that we will later use to make the database. For this project, I am using Python 3. With Python being an incredibly popular language, there exist multiple powerful libraries that can help streamline this project. Personally, I chose openpyxl for its ability to easily iterate through the multiple rows and columns of a spreadsheet and extract whatever was found. Lets go over a very basic example of how to read a spreadsheet with openpyxl, and turn it into a SQL insert.

First off, we need to open the spreadsheet as a workbook via openpyxl to start iterating through it. The way we can do this when passing in the spreadsheet as an argument is:

wb = openpyxl.load_workbook(sys.argv[1])

From here, then we need to activate the workbook with:

sh = wb.active

Now we are ready to start pulling data and outputting SQL code. The output can just be printed to the normal console, and then piped into the Postgres schema later.

The main way to iterate through this workbook would be to call the functions .iter_rows() and .iter_cols(). Then we can specify the index of data we are looking for, and throw it into some SQL code. A typical example of this would look like the following:

for row in sh.iter_rows(min_row=1, min_col=1):
   if row[2].value is not None:
      cell_value = row[2].value
      print("INSERT INTO testtable (value)\n"
            f"VALUES ({cell_value});"

If the cell value on the spreadsheet is “1”, the output of this would then be:

INSERT INTO testtable (value)
VALUES (1);

This output can then be piped into a postgres database.

Testing the ingestion

As is customary with developing any code, it is extremely important to be able to test this process. This is to make sure that the ingestion script isn’t ever picking up the wrong data or to verify it wasn’t broken by any new addition. For this test script, I used Python’s built-in unittest features to create a conventional and easy to understand test suite. In addition to this, the psycopg2 library will be used to allow the tester to connect into the Postgres database, and assert it has the correct data.

To begin, let’s create a basic Python unittest test case with predefined variables connection, and cursor for the Postgres communication:

class InputSiteTester(unittest.TestCase):
   conn = None
   cur = None

In this, we need a setup class that will be performed before the entire test suite begins. In this, we will make a connection via psycopg2 to the database that we can then test against.

@classmethod
def setUpClass(cls) -> None:
   try:
      cls.conn = psycopg2.connect(
         host="YOUR HOST",
         user="YOUR USER",
         password="YOUR POSTGRES PASSWORD")

except psycopg2.OperationalError as error:
   sys.exit(f"Unable to connect to postgres: \n{error}")

Now that we have a connection to the Postgres database, the test should run the ingestion script from the command shell, and pipe this into an output file ingest.sql . To avoid overwriting data that’s already there, if the database isn’t too large it’s a good idea to first drop the tables and then recreate them before each test with a schema, and then pipe in the ingestion script. An example of this can be done as follows:

def setUp(self):
   self.cur = self.conn.cursor()
   with open("schema.psql", "r", encoding="utf8") as schema_file:
       self.cur.execute(schema_file.read())
   with open("ingest.sql", "r", encoding="utf8") as test_sql:
       self.cur.execute(test_sql.read())

This will run before every individual test, resetting the database to ensure that any changes done on it in each test are reset.

Moving on, let’s make a basic test case that will see if the “testtable” from the previous example matches a text file. This text file must first be checked for accuracy since it’ll be asserted against in the tester. More examples of this can be found in the “inputsites” folder in the zoodb repository.

def test1_select_test_table(self):
   """Select test table and assert equal to test text file"""
   with open("test1.txt", "r", encoding="utf8") as testfile:
      expected = testfile.read()

      self.cur.execute("SELECT * FROM TESTTABLE")
      result = str(self.cur.fetchall())

      self.assertEqual(result, expected)
      testfile.close()

In test1.txt, we need to have the following text with no newline:

[(1, 1)]

This will match the primary key and value output given by str(self.cur.fetchall()) when the psycopg2 cursor selects everything from the table called “testtable” in the database.

Finally, we then need to shut down the test suite and tear down the connection to Postgres . We can do this with a tearDownClass from Python’s unittest.

@classmethod
def tearDownClass(cls):
   if cls.cur is not None:
       cls.cur.close()
   if cls.conn is not None:
       cls.conn.close()

Conclusion

In a nutshell, this is the basic path I took to parse a spreadsheet and turn it into a Postgres database (and test it!). This is not by any means the standard or best procedure to go about this, but that’s the beauty of code as one thing can be implemented in many different ways. In any case, thank you for making it this far! For my first blog post, this may be a bit much but I’m excited to finally share my process and what I’ve been working on here at ARC Software Development and the anthropology department at UVic. Next time I hope to share a bit of my thinking on developing the Django web framework for the base functionality of the actual ZooDB webpage.