Struct data types (version 1.0)#
Editor’s draft 2 March 2022
- Specification URI:
- Issue tracking:
- Suggest an edit for this spec:
This specification is a Zarr extension defining data types for structured arrays. It is an early draft and currently just describes existing support for NumPy-style structured arrays that already have support in zarr-python, but are not part of the core Zarr v3 spec.
Status of this document#
This document is a Work in Progress. It may be updated, replaced or obsoleted by other documents at any time. It is inappapropriate to cite this document as other than work in progress.
Comments, questions or contributions to this document are very welcome. Comments and questions should be raised via GitHub issues. When raising an issue, please add the label “struct-dtypes-v1.0”.
This document was produced by the Zarr core development team.
Conformance requirements are expressed with a combination of descriptive assertions and [RFC2119] terminology. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in the normative parts of this document are to be interpreted as described in [RFC2119]. However, for readability, these words do not appear in all uppercase letters in this specification.
All of the text of this specification is normative except sections explicitly marked as non-normative, examples, and notes. Examples in this specification are introduced with the words “for example”.
Extension data types#
NumPy allows representation of Structured Arrays where each element of the
array is actually some combination of fields, each of which may have its own
unique data type. NumPy’s Record Arrays (
numpy.recarray) also use this data type. The actual data is stored as an opaque sequence of bytes
(i.e. a structure) as represented by (
numpy.void) and thus the string
representation of this dtype in NumPy is
n is some integer
number of bytes. In order to be able to properly interpret data of this type
if is necessary to store information on the fields
A concrete example of such an array from the NumPy docs is:
dogs = np.array([('Rex', 9, 81.0), ('Fido', 3, 27.0)], dtype=[('name', 'U10'), ('age', 'i4'), ('weight', 'f4')])
dogs.dtype.kind is ‘V’ and
indicating the 48 bytes are needed to store each element (4 bytes each for
weight and 4 * 10 = 40 bytes for a 10-character UTF-32
name). If we were to read such a sequence of bytes from a Zarr array, we
need the dtype description to know how to properly interpret this sequence of
48 bytes. The NumPy dtype object has a
descr attribute that describes this.
In this case
[('name', '<U10'), ('age', '<i4'), ('weight', '<f4')].
Data Types added by this extension#
Size (no. bytes)
list of (<name>, <type>) tuples
structure with named fields, each with possibly unique data type
sum over the size of the dtypes in the identifier
Here <field name> is the name of the struct field and <type> is any of the
scalar dtypes supported by the core Zarr v3 spec or the available extensions.
In the case of NumPy’s structured arrays this identifier is simply
NumPy Data type objects. NumPy version 1.22.0 documentation. URL: https://numpy.org/doc/1.22/reference/arrays.dtypes.html