New Issue: Design for Apache Arrow Module In Chapel

shreyaskhandekar1 · July 16, 2021, 6:50pm

18074, "ShreyasKhandekar", "Design for Apache Arrow Module In Chapel", "2021-07-16T18:48:44Z"

Design for Apache Arrow Module In Chapel

opened 06:48PM - 16 Jul 21 UTC

This issue is meant to generate discussion about the design for the high level i…nterface of [Apache Arrow](http://arrow.apache.org/) in Chapel which will be built on top of the Low Level Chapel interface. We aim to make things less verbose and more abstract while still retaining all functionality and ability to control what we do. One of the pivotal aspects of designing the Module is for us to be able to create Arrow Data Structures in a succinct manner. Currently creating an Arrow Record batch (the smallest useful data structure; essentially a table) is very tedious and involves building up arrays element by element and putting them together using a schema that we build in a similar fashion. This whole ordeal take ~200 lines of Chapel code (transliterated from the C interface of Apache Arrow which also takes the same amount of code). A primary goal with this design is to get that number down to 5 lines or less in a way that does not compromise usability and clarity. The [PyArrow](http://arrow.apache.org/docs/python/index.html) Python interface is a major source of inspiration for the design of this Chapel Module. Here are some of the ideas I have: ## Creating Arrays Ways we should be able to create arrow arrays. ### From Chapel Arrays We take a Chapel array and use it to create an arrow array by mapping it down to (garrow_(inferred_type_array_builder_append_values) to append values. We might also use an optional validity argument which can describe the NULL Values till we allow null values in Chapel Arrays. That argument need not be a complete array with booleans for each item. It could be an array of indices which are NULL. It could be a bitvector as well. Option 1: ex: ```chapel chplarrow.array([1,2,3,4]); ``` Option 1.1 or ```chapel chplarrow.array(["foo", "bar", "baz", "invalid"], validity = [3]); ``` Option 1.2 or ```chapel chplarrow.array(["foo", "bar", "baz", "invalid"], validity = 1110); ``` We don't technically have to pick one, but we should have a single way to do something. If we want to not infer types too much, we should instead use more functions so that the user will have to specify the type of the array. ex: Option 2 ```chapel chplarrow.int32Array([1,2,3,4]); ``` or Option 2.1 ```chapel chplarrow.stringArray(["foo", "bar", "baz", "invalid"], validity = [3]); ``` or Option 2.2 ```chapel chplarrow.boolArray([true, false, true, false], validity = 1011); ``` Optionally, we could also overload the `array()` function to accept a (mandatory?) type argument as well. #### Note I am going to abbreviate `chplarrow` as `ca` now. ## Creating Schemas A schema is basically metadata about a record batch which tells us what the names and types of the columns of a record batch are. Giving direct access to arrow types is important so that we can actually build a schema. We could use an array of tuples like pyarrow ```chapel ca.schema([("field0", ca.int32()), ("field1", ca.string()), ("field2", ca.bool())]); ``` ## Creating Record Batches We can use the arrays we created above to create Record batches. We can put these arrow arrays inside a Chapel Array to put them together so they may be passed as one argument. We can pass another argument specifying the schema. See how to create a schema [above](#creating-schemas). Say we put the above 3 arrays is an outer array called `data` and put the above schema in a variable called `schema`. We can put then together in a record batch as follows: ```chapel ca.recordBatch(data, schema); ``` Here too, if we are more inclined to infer types we can just pass in the titles of the columns instead of a schema. Please feel free to suggest other ways of doing this or modifying any of the above options to make more sense. #### Note There are other topics that deal with some of the later steps in developing this module and though comments and feedback on it is welcome, I would appreciate if we were able to hash out a preference for - Creating Arrow Arrays of different types - Creating Schemas - Creating Record Batches before moving on to the other topics like IO, IPC and Dealing with Parquet Files

This issue is meant to generate discussion about the design for the high level interface of Apache Arrow in Chapel which will be built on top of the Low Level Chapel interface.

We aim to make things less verbose and more abstract while still retaining all functionality and ability to control what we do.
One of the pivotal aspects of designing the Module is for us to be able to create Arrow Data Structures in a succinct manner.

Currently creating an Arrow Record batch (the smallest useful data structure; essentially a table) is very tedious and involves building up arrays element by element and putting them together using a schema that we build in a similar fashion.

This whole ordeal take ~200 lines of Chapel code (transliterated from the C interface of Apache Arrow which also takes the same amount of code).
A primary goal with this design is to get that number down to 5 lines or less in a way that does not compromise usability and clarity.

The PyArrow Python interface is a major source of inspiration for the design of this Chapel Module.

Here are some of the ideas I have:

Creating Arrays

Ways we should be able to create arrow arrays.

From Chapel Arrays

We take a Chapel array and use it to create an arrow
array by mapping it down to
(garrow_(inferred_type_array_builder_append_values)
to append values.
We might also use an optional validity argument which can
describe the NULL Values till we allow null values in
Chapel Arrays.

That argument need not be a complete array with booleans
for each item.
It could be an array of indices which are NULL.
It could be a bitvector as well.
Option 1:
ex:

chplarrow.array([1,2,3,4]);

Option 1.1
or

chplarrow.array(["foo", "bar", "baz", "invalid"], validity = [3]);

Option 1.2
or

chplarrow.array(["foo", "bar", "baz", "invalid"], validity = 1110);

We don't technically have to pick one, but we should have
a single way to do something.

If we want to not infer types too much, we should instead
use more functions so that the user will have to specify
the type of the array.

ex:
Option 2

chplarrow.int32Array([1,2,3,4]);

or
Option 2.1

chplarrow.stringArray(["foo", "bar", "baz", "invalid"], validity = [3]);

or
Option 2.2

chplarrow.boolArray([true, false, true, false], validity = 1011);

Optionally, we could also overload the array() function
to accept a (mandatory?) type argument as well.

Note

I am going to abbreviate chplarrow as ca now.

Creating Schemas

A schema is basically metadata about a record batch which tells us what
the names and types of the columns of a record batch are.
Giving direct access to arrow types is important so that
we can actually build a schema.

We could use an array of tuples like pyarrow

ca.schema([("field0", ca.int32()),
            ("field1", ca.string()),
            ("field2", ca.bool())]);

Creating Record Batches

We can use the arrays we created above to create Record
batches.
We can put these arrow arrays inside a Chapel Array to
put them together so they may be passed as one argument.

We can pass another argument specifying the schema.
See how to create a schema above.

Say we put the above 3 arrays is an outer array called
data and put the above schema in a variable called
schema.

We can put then together in a record batch as follows:

ca.recordBatch(data, schema);

Here too, if we are more inclined to infer types we can
just pass in the titles of the columns instead of a
schema.

Please feel free to suggest other ways of doing this or modifying any of the above options to make more sense.

Note

There are other topics that deal with some of the later steps in developing this module and though comments and feedback on it is welcome, I would appreciate if we were able to hash out a preference for

Creating Arrow Arrays of different types
Creating Schemas
Creating Record Batches

before moving on to the other topics like IO, IPC and Dealing with Parquet Files

Topic	Replies	Views
Announcing Chapel version 1.19! Announcements	234	March 22, 2019
Announcing Chapel 1.23.0! Announcements	377	October 15, 2020
New chapel questions for Jan 15 - Stack Exchange Stack Overflow	249	January 15, 2021
Announcing Chapel 1.26.0! Announcements	197	April 1, 2022
Announcing Chapel 1.24.0! Announcements	552	March 18, 2021

New Issue: Design for Apache Arrow Module In Chapel

Creating Arrays

From Chapel Arrays

Note

Creating Schemas

Creating Record Batches

Note

Related Topics