Accidentally Quadratic (with Lists).#
Certain computations can be conveniently be expressed as for each item do
something. Sometimes it’s a trap. If it’s obvious that the complexity of the
total calculation is linear (or N log(N)), then this conceptually easy
approach can turn accidentally quadratic if the “something” doesn’t run in
constant (or logarithmic) time. A typical example is concatenating strings as
follows:
ret = ""
for s in strings:
ret = ret + s
return ret;Because ret slowly grows on each iteration, each concatenation is linear
in the sum of ret and s. Therefore, if len(strings) is big and the
longest string in strings is short, the operation is quadratic. Obviously, it
can be done in linear time, if one preallocates sum(len(s) for s in strings)
and then copies each s to its place in the output buffer.
It’s slightly vicious because:
ret = []
for s in strings:
for c in s:
ret.append(c)is not (typically) accidentally quadratic. In Python it’s linear, because
we can append to a list in O(1). In C++ if [] is an std::vector, it’s
also linear because of how a vector grows by doubling it’s capacity.
The HighFive Variant.#
In HDF5 a dataset is a multi-dimensional array, e.g. size [n0, ..., nN].
Using regular hyperslabs, one can select regular sub-regions, i.e. the
product of intervals, or more simply put “rectangles”. One can then select
arbitrary parts of the dataset by computing the union of regular sub-regions.
It’s not important to consider that they could overlap, we ignore that aspect.
We know that HDF5 stores the selection as a sorted list of regular hyperslabs.
Which is great because if it’s sorted we can use bisection to find where we
need to insert the new sub-region in O(log(N)) and because it’s a list we can
insert in O(1).
The fallacy: lists don’t have random access. Hence, unless they support
something like skip-lists (which HDF5 doesn’t), bisection doesn’t work and
instead it’s O(N) to find the location to insert the item.
Overall, iteratively computing the union of regular hyperslabs takes O(N**2).
The Fix.#
The pragmatic way of avoiding the issue is by learning that HDF5 supports
computing the union of two arbitrarily complex hyperslabs via
H5Scombine_select. This opens up the path for divide and conquer.
We’re belabouring the point too much already, but here’s the commit that implements the change in HighFive.