Splitting a string list in Python

Drew Leske

I’m definitely a fan of list comprehensions in Python (as well as dict comprehensions), but I have some pretty specific ideas about how they should be formatted to be readable. Dict comprehensions are more complex, but even the much simpler list comprehensions can be abused, or misleading even when used well.

I have been caught a couple of times when I come across the following in a code review:

mylist = [s for s in mystr.split(',') if s]

Every time I come across that, I think to myself, “Why is that a list comprehension? split() already returns a list. It must be an artifact of older, more complex code.” It will usually occur to me shortly thereafter that the if s eliminates empty strings:

mystr = 'a,, b, '
print(f"Initial string: '{mystr}'")

mylist_js = mystr.split(',')
print("Just splitting: ", mylist_js)

mylist_lc = [s for s in mystr.split(',') if s]
print("Splitting and list comprehension: ", mylist_lc)

Gives:

Initial string: 'a,, b, '
Just splitting:  ['a', '', ' b', ' ']
Splitting and list comprehension:  ['a', ' b', ' ']

But I don’t like the readability, because it’s caught me several times–and I’d argue that removing empty strings looks like a side effect even if it’s not. But most importantly it interrupts my reading of the code.

In this case it’s not just style or readability. As can be seen in the above example, the initial string contained more commas than necessary, resulting in an almost-empty string after the split, which in this case we don’t want, and also, the second string has an unnecessary space, which we also don’t want. My first instinct is to split on a regular expression:

import re

re_split = re.compile(r'\s*,\s*')
mylist_re = re_split.split(mystr)
print("Splitting on RE: ", mylist_re)

This gives us two empty strings, as might have been predicted:

Splitting on RE:  ['a', '', 'b', '']

But I can use the list comprehension:

print("Splitting on RE, as a list comprehension: ",
  [ x for x in mylist_re if x ])

And get just the two strings. However, like I stated at the beginning of all this, I don’t love the readability. The first thing that occurs to me is to use filter(). Now, I don’t use this very often: it’s a function, not a method on list or other iterable types; and it returns an iterator, not a list or other collection, so the results then need to be converted. So let’s give that a shot:

print("Splitting on RE, and using filter: ", list(filter(lambda x: x,
mylist_re)))

The filter will include elements that evaluate to True, so the lambda function is intended to exploit that. (It turns out the function also accepts None as a shortcut.) The results are identical for the two approaches:

Splitting on RE, as a list comprehension:  ['a', 'b']
Splitting on RE, and using filter:  ['a', 'b']

Okay, great, but the “improvement” is less readable and now I’m starting to wonder about the performance cost as well. So what’s more “pythonic”?

At least one answer on Stack Overflow states that the list comprehension is more Pythonic. And a comment suggests an easy way to rid of extra whitespace is to use strip(). Is that better or worse than my regular expression?

print("Splitting on ',' and stripping whitespace as part of comprehension: ",
  [ x.strip() for x in mystr.split(',') if x.strip() ])

This is readable–no better or worse than the regular expression splitting I was using before. I’m stripping each element twice (maybe Python optimizes this behind the scenes, I don’t know), so that’s unlovely but unavoidable. Okay, so how about performance?

Performance

From that same Stack Overflow question I learned of a timing library I haven’t used before: timeit. It apparently “avoids a number of common traps for measuring execution times”–I don’t know what these are, which almost certainly means I’m falling into all of them, twice. So let’s give that library a shot and compare the various methods for performance.

Cobbling together the key previous bits into one script with timeit calls, I get:

import re
from timeit import timeit

mystr = 'a,, b, '
re_split = re.compile(r'\s*,\s*')
execs = 1000000

print(f"Splitting on string: '{mystr}'")
globals = {
  'mystr': mystr,
  're_split': re_split
}
print("Splitting on RE, filtering with a list comprehension: ",
  timeit(
    '[ x for x in re_split.split(mystr) if x ]',
    number=execs, globals=globals)
  )
print("Splitting on RE, filtering with filter(lambda): ",
  timeit(
    'list(filter(lambda x: x, re_split.split(mystr)))',
    number=execs, globals=globals)
  )
print("Splitting on RE, filtering with filter(None): ",
  timeit(
    'list(filter(None, re_split.split(mystr)))',
    number=execs, globals=globals)
  )
print("Splitting on string, stripping as we go: ",
  timeit(
    '[ x.strip() for x in mystr.split(",") if x.strip() ]',
    number=execs, globals=globals)
  )

The results:

Splitting on string: 'a,, b, '
Splitting on RE, filtering with a list comprehension:  0.43394716699999997
Splitting on RE, filtering with filter(lambda):  0.584588917
Splitting on RE, filtering with filter(None):  0.45027712500000017
Splitting on string, stripping as we go:  0.34553225

Well.

From this we can see that my initial filter with a lambda expression had a significant cost over what I thought was equivalent (filter(None, iterable)). They each get the job done, though, and the filter is almost as fast as the list comprehension. Given the speed difference, though, I would say to go with what is more readable; this code isn’t time-critical. As mentioned previously, though, I’m definitely leaning towards the list comprehension. And it turns out the list comprehension without the regular expression, stripping whitespace from each element twice is significantly faster than any of the regular expression methods.

Okay, but this is with a very small string. In the particular use case driving this exploration, the strings are likely to be few, but longer, so let’s try that:

Splitting on string: 'myhost.mydept.example.org, google.com,example.org,
yourhost.yourdept.example.com'
Splitting on RE, filtering with a list comprehension:  1.44834375
Splitting on RE, filtering with filter(lambda):  1.595106542
Splitting on RE, filtering with filter(None):  1.444467542
Splitting on string, stripping as we go:  0.5286096660000004

Archie, after reviewing this entry, mused what would happen if the string to be split contained 10,000 items. Oh, Archie, you rascal–how did you know I am powerless against the pull to procrastinate with hypotheticals.

If my estimations are correct (and the times grow linearly) the test would take half a day. So I tried 1,000 randomly generated, fake “domains”, for a total string size of 25,811 characters. The results:

Splitting string: ' computation.usa. ,,...'
Splitting on RE, filtering with a list comprehension:  426.811523458
Splitting on RE, filtering with filter(lambda):  454.63292695800004
Splitting on RE, filtering with filter(None):  422.367086417
Splitting on string, stripping as we go:  132.27743525000005

Seems pretty much in line with results on the smaller strings, except looking at the numbers using a basic analysis, it looks like the time increased less with additional terms and string length in the simple split than it did with the RE methods.

	manual	generated	factor
characters	80	25811	322.6
terms	4	1298	324.5
RE/comp	1.44	426.81	296.3
RE/filter (lambda)	1.61	454.63	282.2
RE/filter (None)	1.45	422.37	290.7
Split/strip	0.53	132.28	251.6

So there we have it: list comprehensions with individual stripping is the way to go.

Lessons learned

The cleanest method, in my opinion, was to split by regular expression to capture superfluous whitespace. It turns out the cost of this is relatively high.

The fastest, by a significant margin, is the least sophisticated: splitting on a comma, stripping whitespace (twice!) on each token.

Yeah, I’m surprised, but also sort of pleased: the most readable method is the fastest. And sophistication at this cost is a fake idea. So this is another case of simplest being best.

References

How I built the 1,000 terms string

For the interested reader. Or me, in 18 months, when I need something similar. Stranger things have happened.

A handy file to have around is a basic list of English words, one per line in a straight text file. I use this for multiple purposes, mostly for testing–this sort of thing. Or I suppose I could create lorem ipsum text, but there are “controversial” words in there.

I absorbed this file into a bash script in an array, then constructed “domains” out of 2-4 words separated by periods. Then I randomly stuck spaces (or didn’t) around each term and its separator comma. It was a complete hack job:

#!/bin/bash

words = (
  ...
)

numwords=${#words[*]}
for ((i=0; i<1000; i++))
do
  fqdn=""
  sz=$((RANDOM % 3 + 2))
  for ((j=0; j<$sz; j++))
  do
    r=$((RANDOM % numwords + 1))
    fqdn="${fqdn}${words[$r]}."
  done
  spaces=$((RANDOM % 6))
  case $spaces in
    0) echo -n " $fqdn," ;;
    1) echo -n "$fqdn, " ;;
    2) echo -n " $fqdn ," ;;
    3) echo -n ", ,$fqdn," ;;
    *) echo -n "$fqdn,"
  esac
done

The script is complete garbage, but it served its purpose; moreover, the way the IANA is going these days, those are probably all valid domains.