Struct data types (version 1.0)#

Editor’s draft 2 March 2022

Specification URI:

https://purl.org/zarr/spec/extensions/data-types/struct/1.0

Issue tracking:

GitHub issues

Suggest an edit for this spec:

GitHub editor

Copyright 2022 Zarr core development team (@@TODO list institutions?). This work is licensed under a Creative Commons Attribution 3.0 Unported License.


Abstract#

This specification is a Zarr extension defining data types for structured arrays. It is an early draft and currently just describes existing support for NumPy-style structured arrays that already have support in zarr-python, but are not part of the core Zarr v3 spec.

Status of this document#

Warning

This document is a Work in Progress. It may be updated, replaced or obsoleted by other documents at any time. It is inappapropriate to cite this document as other than work in progress.

Comments, questions or contributions to this document are very welcome. Comments and questions should be raised via GitHub issues. When raising an issue, please add the label “struct-dtypes-v1.0”.

This document was produced by the Zarr core development team.

Document conventions#

Conformance requirements are expressed with a combination of descriptive assertions and [RFC2119] terminology. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in the normative parts of this document are to be interpreted as described in [RFC2119]. However, for readability, these words do not appear in all uppercase letters in this specification.

All of the text of this specification is normative except sections explicitly marked as non-normative, examples, and notes. Examples in this specification are introduced with the words “for example”.

Extension data types#

NumPy allows representation of Structured Arrays where each element of the array is actually some combination of fields, each of which may have its own unique data type. NumPy’s Record Arrays (numpy.recarray) also use this data type. The actual data is stored as an opaque sequence of bytes (i.e. a structure) as represented by (numpy.void) and thus the string representation of this dtype in NumPy is '|Vn' where n is some integer number of bytes. In order to be able to properly interpret data of this type if is necessary to store information on the fields

A concrete example of such an array from the NumPy docs is:

dogs = np.array([('Rex', 9, 81.0), ('Fido', 3, 27.0)],
                dtype=[('name', 'U10'), ('age', 'i4'), ('weight', 'f4')])

where here dogs.dtype.kind is ‘V’ and dogs.dtype.str is '|V48' indicating the 48 bytes are needed to store each element (4 bytes each for age and weight and 4 * 10 = 40 bytes for a 10-character UTF-32 name). If we were to read such a sequence of bytes from a Zarr array, we need the dtype description to know how to properly interpret this sequence of 48 bytes. The NumPy dtype object has a descr attribute that describes this. In this case dogs.dtype.descr is [('name', '<U10'), ('age', '<i4'), ('weight', '<f4')].

Data Types added by this extension#

Data types#

Identifier

Numerical type

Size (no. bytes)

Byte order

list of (<name>, <type>) tuples

structure with named fields, each with possibly unique data type

sum over the size of the dtypes in the identifier

None

Here <field name> is the name of the struct field and <type> is any of the scalar dtypes supported by the core Zarr v3 spec or the available extensions. In the case of NumPy’s structured arrays this identifier is simply array’s .dtype.descr attribute.

References#

NumPy

NumPy Data type objects. NumPy version 1.22.0 documentation. URL: https://numpy.org/doc/1.22/reference/arrays.dtypes.html

Change log#

@@TODO