New Issue: Design for Apache Arrow Module In Chapel

18074, "ShreyasKhandekar", "Design for Apache Arrow Module In Chapel", "2021-07-16T18:48:44Z"

This issue is meant to generate discussion about the design for the high level interface of Apache Arrow in Chapel which will be built on top of the Low Level Chapel interface.

We aim to make things less verbose and more abstract while still retaining all functionality and ability to control what we do.
One of the pivotal aspects of designing the Module is for us to be able to create Arrow Data Structures in a succinct manner.

Currently creating an Arrow Record batch (the smallest useful data structure; essentially a table) is very tedious and involves building up arrays element by element and putting them together using a schema that we build in a similar fashion.

This whole ordeal take ~200 lines of Chapel code (transliterated from the C interface of Apache Arrow which also takes the same amount of code).
A primary goal with this design is to get that number down to 5 lines or less in a way that does not compromise usability and clarity.

The PyArrow Python interface is a major source of inspiration for the design of this Chapel Module.

Here are some of the ideas I have:

Creating Arrays

Ways we should be able to create arrow arrays.

From Chapel Arrays

We take a Chapel array and use it to create an arrow
array by mapping it down to
(garrow_(inferred_type_array_builder_append_values)
to append values.
We might also use an optional validity argument which can
describe the NULL Values till we allow null values in
Chapel Arrays.

That argument need not be a complete array with booleans
for each item.
It could be an array of indices which are NULL.
It could be a bitvector as well.
Option 1:
ex:

chplarrow.array([1,2,3,4]);

Option 1.1
or

chplarrow.array(["foo", "bar", "baz", "invalid"], validity = [3]);

Option 1.2
or

chplarrow.array(["foo", "bar", "baz", "invalid"], validity = 1110);

We don't technically have to pick one, but we should have
a single way to do something.

If we want to not infer types too much, we should instead
use more functions so that the user will have to specify
the type of the array.

ex:
Option 2

chplarrow.int32Array([1,2,3,4]);

or
Option 2.1

chplarrow.stringArray(["foo", "bar", "baz", "invalid"], validity = [3]);

or
Option 2.2

chplarrow.boolArray([true, false, true, false], validity = 1011);

Optionally, we could also overload the array() function
to accept a (mandatory?) type argument as well.

Note

I am going to abbreviate chplarrow as ca now.

Creating Schemas

A schema is basically metadata about a record batch which tells us what
the names and types of the columns of a record batch are.
Giving direct access to arrow types is important so that
we can actually build a schema.

We could use an array of tuples like pyarrow

ca.schema([("field0", ca.int32()),
            ("field1", ca.string()),
            ("field2", ca.bool())]);

Creating Record Batches

We can use the arrays we created above to create Record
batches.
We can put these arrow arrays inside a Chapel Array to
put them together so they may be passed as one argument.

We can pass another argument specifying the schema.
See how to create a schema above.

Say we put the above 3 arrays is an outer array called
data and put the above schema in a variable called
schema.

We can put then together in a record batch as follows:

ca.recordBatch(data, schema);

Here too, if we are more inclined to infer types we can
just pass in the titles of the columns instead of a
schema.

Please feel free to suggest other ways of doing this or modifying any of the above options to make more sense.

Note

There are other topics that deal with some of the later steps in developing this module and though comments and feedback on it is welcome, I would appreciate if we were able to hash out a preference for

  • Creating Arrow Arrays of different types
  • Creating Schemas
  • Creating Record Batches

before moving on to the other topics like IO, IPC and Dealing with Parquet Files