This post is going to be a small attempt from my side to understand indexing ragged arrays and making that work with Python. Indexing a data structure is accessing its elements by making memory efficient. A ragged array or a jagged array is an array of arrays which the member arrays can be of different lengths. This is quite straightforward to implement when we think through Python.
Let’s try to write a simple custom RaggedArray
class in Python which supports indexing.
Here, we can access arr[103847]
to get the sequential 103847
index, or arr[287, 326]
to get element 326
of subarray 287
. We can do arr[i, j]
in O(1)
and arr[k]
in O(n*n)
time complexity. chain.from_iterable
creates a flattened array from the non-rectangular 2D array.
Let’s try to make this more efficient. We used Python lists in this example. Numpy is widely known for its multidimensional array object for holding homogeneously-typed numerical data and manipulating large numbers of values without writing inefficient python loops. Instead of creating copies, numpy uses views to access elements in O(1)
time. However, a numpy array having arrays of different shapes is considered as a normal Python list and it’s not possible to form a view out of it since the data are not contiguous in memory and operations are not vectorized. Awkward array module supports views for ragged arrays. Now, let’s modify our RaggedArray
class to use awkward arrays instead of Python lists.
This awkward
module lets us to create a ragged array from a list using awkward.fromiter()
function. Here, we also extended the indexing to 4 dimensions. Since we are using flatten()
operation, it can access the elements in O(1)
time.
This code is still inefficient since we’re using integers for indexing. We can use boolean arrays for more efficiency. Let’s try to write a simple Bitset
class in Python which can support AND
, OR
operations, takes a set of integers and prints the boolean values on call.
It internally uses a boolean list created from a set of integers. We can use this instead of integers as index. A more popular compressed index is Enhanced Word-Aligned Hybrid (EWAH) index. Its C++ implementation can be found here. It is used by a lot of popular softwares including git, yt, etc. yt uses it as a Cython wrapper. It is available as a Cython package here.