String data types (version 1.0)

Editor’s draft 2 March 2022

Specification URI:

http://purl.org/zarr/spec/extensions/string-dtypes/1.0

Issue tracking:

GitHub issues

Suggest an edit for this spec:

GitHub editor

Copyright 2022 Zarr core development team (@@TODO list institutions?). This work is licensed under a Creative Commons Attribution 3.0 Unported License.


Abstract

This specification is a Zarr extension defining data types for strings. It is an early draft and currently just describes existing support for NumPy string types that have already worked with zarr-python, but are not part of the core v3 spec.

Status of this document

This document is a Work in Progress. It may be updated, replaced or obsoleted by other documents at any time. It is inappapropriate to cite this document as other than work in progress.

Comments, questions or contributions to this document are very welcome. Comments and questions should be raised via GitHub issues. When raising an issue, please add the label “string-dtypes-v1.0”.

This document was produced by the Zarr core development team.

Document conventions

Conformance requirements are expressed with a combination of descriptive assertions and [RFC2119] terminology. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in the normative parts of this document are to be interpreted as described in [RFC2119]. However, for readability, these words do not appear in all uppercase letters in this specification.

All of the text of this specification is normative except sections explicitly marked as non-normative, examples, and notes. Examples in this specification are introduced with the words “for example”.

Extension data types

Two extension data types are defined to represent zero-terminated bytestrings as well as fixed-length 32-bit unicode arrays.

Fixed length byte strings (zero-terminated)

These are fixed width strings corresponding to NumPy dtypes with kind ‘S’. For backward compatibility with Python 2’s str these are zero-terminated bytes and correspond to numpy.bytes_ <https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.bytes_> (or its alias numpy.string_ <https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.bytes_>.)

For example a = np.array(["a", "bcd", "efgh"], dtype="S4") creates an array where a.dtype.kind is ‘S’ and a.data.tobytes() is b'a\x00\x00\x00bcd\x00efgh'. Note that any elements of length less than 4 characters were padded with zeros so that each array element uses 4 bytes (as indicated by dtype="S4").

Fixed width unicode strings

These are fixed width strings corresponding to NumPy dtypes with kind ‘U’. These are zero-terminated bytes and correspond to numpy.str_ <https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.str_> (or its alias numpy.unicode_ <https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.unicode_>.)

For example a = np.array(["a", "bcd", "efgh"], dtype="U4") creates an array where a.dtype.kind is ‘U’ and each element is a sequence of 4 characters where each character occupies 4 bytes (UTF-32).

Data Types added by this extension

Data types

Identifier

Numerical type

Size (no. bytes)

Byte order

Sn

n character fixed-width byte string

n

None

<Un

n character fixed-width unicode string (UTF-32)

4n

little-endian

>Un

n character fixed-width unicode string (UTF-32)

4n

big-endian

References

UTF-32

UTF-32 on Wikipedia. documentation. URL: https://en.wikipedia.org/wiki/UTF-32

NumPy

NumPy Data type objects. NumPy version 1.22.0 documentation. URL: https://numpy.org/doc/1.22/reference/arrays.dtypes.html

Change log

@@TODO