Sharding storage transformer (version 1.0)#

Editor’s draft 18 02 2022

Specification URI:

https://zarr-specs.readthedocs.io/en/latest/v3/array-storage-transformers/sharding/v1.0.html

Corresponding ZEP:

ZEP 2 — Sharding storage transformer

Issue tracking:

GitHub issues <https://github.com/zarr-developers/zarr-specs/labels/storage-transformers-sharding-v1.0>-

Suggest an edit for this spec:

GitHub editor

Copyright 2022-Present Zarr core development team. This work is licensed under a Creative Commons Attribution 3.0 Unported License.


Abstract#

This specification defines an implementation of the Zarr storage transformer specification for sharding.

Sharding co-locates multiple chunks within a storage object, bundling them in shards.

Status of this document#

Warning

This document is a draft for review and subject to changes. It will become final when the Zarr Enhancement Proposal (ZEP) 2 is approved via the ZEP process.

Motivation#

In many cases, it becomes inefficient or impractical to store a large number of chunks in single files or objects due to the design constraints of the underlying storage. For example, the file block size and maximum inode number restrict the usage of numerous small files for typical file systems, also cloud storage such as S3, GCS, and various distributed filesystems do not efficiently handle large numbers of small files or objects.

Increasing the chunk size works only up to a certain point, as chunk sizes need to be small for read and write efficiency requirements, for example to stream data in browser-based visualization software.

Therefore, chunks may need to be smaller than the minimum size of one storage key. In those cases, it is efficient to store objects at a more coarse granularity than reading chunks.

Sharding solves this by allowing to store multiple chunks in one storage key, which is called a shard:

../../../_images/sharding.png

Document conventions#

Conformance requirements are expressed with a combination of descriptive assertions and [RFC2119] terminology. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in the normative parts of this document are to be interpreted as described in [RFC2119]. However, for readability, these words do not appear in all uppercase letters in this specification.

All of the text of this specification is normative except sections explicitly marked as non-normative, examples, and notes. Examples in this specification are introduced with the words “for example”.

Configuration#

Sharding can be configured per array in the Array metadata as follows:

{
  storage_transformers: [
    {
      "name": "sharding",
      "configuration": {
        "chunks_per_shard": [2, 2]
      }
    }
  ]
}

chunks_per_shard

An array of integers providing the number of chunks that are combined in a shard for each dimension of the Zarr array, where each chunk may only start at a position that is divisible by chunks_per_shard per dimension, e.g. starting at the zero-origin. The length of the array must match the length of the array metadata shape entry. For example, a value [32, 2] indicates that 64 chunks are combined in one shard, 32 along the first dimension, and for each of those 2 along the second dimension. Some valid starting positions for a shard in the chunk-grid are therefore [0, 0], [32, 2], [32, 4], [64, 2] or [96, 18].

Storage transformer implementation#

Key & value transformation#

The storage transformer specification defines the abstract interface to be the same as the Abstract store interface.

The Zarr store interface is defined as a mapping of keys and values, where a key is a sequence of characters and a value is a sequence of bytes. A key-value pair is called entry in the following part.

This sharding transformer only adapts entries where the key starts with data/root, as they indicate data keys for array chunks, see Storage keys. All other entries are simply passed on.

Entries starting with data/root are grouped by their common shard, assuming storage keys from a regular chunk grid which may use a custom configured chunk separator: For all entries that are part of the same shard the key is changed to the shard-key and the values are combined in the Binary shard format as described below. The new shard-key is the chunk key divided by chunks_per_shard and floored per dimension. For example for chunks_per_shard=[32, 2], the chunk grid position [96, 18] (e.g. key “data/root/foo/baz/c96/18”) is transformed to the shard grid position [3, 9] and reassigned to the respective new key, honoring the original chunk separator (e.g. “data/root/foo/baz/c3/9”). Chunk grid positions [96, 19], [97, 18], …, up to [127, 19] will also have the same shard grid position [3, 9].

Binary shard format#

The only binary format at the moment is an indexed binary format as described below. Other binary formats might be added in future versions.

In the indexed binary format, chunks are written successively in a shard, where unused space between them is allowed, followed by an index referencing them. The index is placed at the end of the file and has a size of 16 bytes multiplied by the number of chunks in a shard, for example 16 bytes * 64 = 1014 bytes for chunks_per_shard=[32, 2]. The index holds an offset, nbytes pair of little-endian uint64 per chunk, the chunks-order in the index is row-major (C) order, for example for chunks_per_shard=[2, 2] an index would look like:

| chunk (0, 0)    | chunk (0, 1)    | chunk (1, 0)    | chunk (1, 1)    |
| offset | nbytes | offset | nbytes | offset | nbytes | offset | nbytes |
| uint64 | uint64 | uint64 | uint64 | uint64 | uint64 | uint64 | uint64 |

Empty chunks are denoted by setting both offset and nbytes to 2^64 - 1. The index always has the full shape of all possible chunks per shard, even if they are outside of the array size.

The actual order of the chunk content is not fixed and may be chosen by the implementation as all possible write orders are valid according to this specification and therefore can be read by any other implementation. When writing partial chunks into an existing shard no specific order of the existing chunks may be expected. Some writing strategies might be

  • Fixed order: Specify a fixed order (e.g. row-, column-major, or Morton order). When replacing existing chunks larger or equal-sized chunks may be replaced in-place, leaving unused space up to an upper limit that might possibly be specified. Please note that for regular-sized uncompressed data all chunks have the same size and can therefore be replaced in-place. > *

  • Append-only: Any chunk to write is appended to the existing shard, followed by an updated index. If previous chunks are updated, their storage space becomes unused, as well as the previous index. This might be useful for storage that only allows append-only updates.

  • Other formats: Other formats that accept additional bytes at the end of the file (such as HDF) could be used for storing shards, by writing the chunks in the order the format prescribes and appending a binary index derived from the byte offsets and lengths at the end of the file.

Any configuration parameters for the write strategy must not be part of the metadata document, they need to be configured at runtime, as this is implementation specific.

API implementation#

The section below defines an implementation of the Abstract store interface in terms of the operations of this storage transformer as a StoreWithPartialAccess. The term underlying store references either the next storage transformer in the stack or the actual store if this transformer is the last one in the stack. Any operations with keys not starting with data/root are simply relayed to the underlying store and not described explicitly.

  • get_partial_values(key_ranges) -> values: For each referenced key, request the indices from the underlying store using get_partial_values. For each key, range pair in in key_ranges, check if the chunk exists by checking if the index offset and nbytes are both 2^64 - 1. For existing keys, request the actual chunks by their ranges as read from the index using get_partial_values. This operation should be implemented using two get_partial_values operations on the underlying store, one for retrieving the indices and one for retrieving existing chunks.

  • set_partial_values(key_start_values): For each referenced key, check if all available chunks in a shard are referenced. In this case, a shard can be constructed according to the Binary shard format directly. For all other keys, request the indices from the underlying store using get_partial_values. All chunks that are not updated completely and exist according to the index (index offset and nbytes are both 2^64 - 1) need to be read via get_partial_values from the underlying store. For simplification purposes a shard may also be read completely, combining the previous two get operations into one. Based on the existing chunks and value ranges that need to be updated new shards are constructed according to the Binary shard format. All shards that need to be updated must now be set via set or set_partial_values(key_start_values), depending on the chosen writing strategy provided by the implementation. Specialized store implementations that allow appending to a storage object may only need to read the index to update it.

  • erase_values(keys): For each referenced key, check if all available chunks in a shard are referenced. In this case, the full shard is removed using erase_values on the underlying store. For all other keys, request the indices from the underlying store using get_partial_values. Update the index using an offset and nbytes of 2^64 - 1 to mark missing chunks. The updated index may be written in-place using set_partial_values(key_start_values), or a larger rewrite of the shard may be done including the index update, but also removing value ranges corresponding to the erased chunks.

  • erase_prefix(): If the prefix contains a part of the chunk-grid key, this part is translated to the referenced shard and contained chunks. For affected shards where all contained chunks are erased the prefix is rewritten to the corresponding shard key and the operation is relayed to the underlying store. For all shards where only some chunks are erased the affected chunks are removed by invoking the operation erase_values on this storage transformer with the respective chunk keys.

  • list(): See list_prefix with the prefix /.

  • list_prefix(prefix): If the prefix contains a part of the chunk-grid key, this part is translated to the referenced shard and contained chunks. Then, list_prefix is called on the underlying store with the translated prefix. For all listed shards request the indices from the underlying store using get_partial_values. Existing chunks, where the index offset or nbytes are not 2^64 - 1 are then listed by their original key.

  • list_dir(prefix): If the prefix contains a part of the chunk-grid key, this part is translated to the referenced shard and contained chunks. Then, list_dir is called on the underlying store with the translated prefix. For all retrieved prefixes (not full keys) with partial shard keys, the corresponding original prefixes covering all possible chunks in the shard are listed. For retrieved full keys the indices from the underlying store are requested using get_partial_values. Existing chunks, where the index offset or nbytes are not 2^64 - 1 are then listed by their original key.

    Note

    Not all listed prefixes must necessarily contain keys, as shard prefixes with partially available chunks return prefixes for all possible chunks without verifying their existence for performance reasons. Listing those prefixes is still safe as some chunks in their corresponding shard exist, but not necessarily in the requested prefix, possibly leading to empty responses. Please note that this only applies to returned prefixes, not for full keys referencing storage objects. Returned full keys always reflect the available chunks and are safe to request.

References#

RFC2119(1,2)

S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://tools.ietf.org/html/rfc2119

Change log#

This section is a placeholder for keeping a log of the snapshots of this document that are tagged in GitHub and what changed between them.