Sharding storage transformer (version 1.0)#
Editor’s draft 18 02 2022
- Specification URI:
https://zarr-specs.readthedocs.io/en/latest/v3/array-storage-transformers/sharding/v1.0.html
- Corresponding ZEP:
- Issue tracking:
GitHub issues <https://github.com/zarr-developers/zarr-specs/labels/storage-transformers-sharding-v1.0>-
- Suggest an edit for this spec:
Copyright 2022-Present Zarr core development team. This work is licensed under a Creative Commons Attribution 3.0 Unported License.
Abstract#
This specification defines an implementation of the Zarr storage transformer specification for sharding.
Sharding co-locates multiple chunks within a storage object, bundling them in shards.
Status of this document#
Warning
This document is a draft for review and subject to changes. It will become final when the Zarr Enhancement Proposal (ZEP) 2 is approved via the ZEP process.
Motivation#
In many cases, it becomes inefficient or impractical to store a large number of chunks in single files or objects due to the design constraints of the underlying storage. For example, the file block size and maximum inode number restrict the usage of numerous small files for typical file systems, also cloud storage such as S3, GCS, and various distributed filesystems do not efficiently handle large numbers of small files or objects.
Increasing the chunk size works only up to a certain point, as chunk sizes need to be small for read and write efficiency requirements, for example to stream data in browser-based visualization software.
Therefore, chunks may need to be smaller than the minimum size of one storage key. In those cases, it is efficient to store objects at a more coarse granularity than reading chunks.
Sharding solves this by allowing to store multiple chunks in one storage key, which is called a shard:

Document conventions#
Conformance requirements are expressed with a combination of descriptive assertions and [RFC2119] terminology. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in the normative parts of this document are to be interpreted as described in [RFC2119]. However, for readability, these words do not appear in all uppercase letters in this specification.
All of the text of this specification is normative except sections explicitly marked as non-normative, examples, and notes. Examples in this specification are introduced with the words “for example”.
Configuration#
Sharding can be configured per array in the Array metadata as follows:
{
storage_transformers: [
{
"name": "sharding",
"configuration": {
"chunks_per_shard": [2, 2]
}
}
]
}
chunks_per_shard
An array of integers providing the number of chunks that are combined in a shard for each dimension of the Zarr array, where each chunk may only start at a position that is divisible by
chunks_per_shard
per dimension, e.g. starting at the zero-origin. The length of the array must match the length of the array metadatashape
entry. For example, a value[32, 2]
indicates that 64 chunks are combined in one shard, 32 along the first dimension, and for each of those 2 along the second dimension. Some valid starting positions for a shard in the chunk-grid are therefore [0, 0], [32, 2], [32, 4], [64, 2] or [96, 18].
Storage transformer implementation#
Key & value transformation#
The storage transformer specification defines the abstract interface to be the same as the Abstract store interface.
The Zarr store interface is defined as a mapping of keys and values, where a key is a sequence of characters and a value is a sequence of bytes. A key-value pair is called entry in the following part.
This sharding transformer only adapts entries where the key starts with data/root, as they indicate data keys for array chunks, see Storage keys. All other entries are simply passed on.
Entries starting with data/root
are grouped by their common shard, assuming
storage keys from a regular chunk grid which may use a custom configured
chunk separator
: For all entries that are part of the same shard the key is
changed to the shard-key and the values are combined in the
Binary shard format as described below. The new shard-key is the chunk key
divided by chunks_per_shard
and floored per dimension. For example for
chunks_per_shard=[32, 2]
, the chunk grid position [96, 18]
(e.g. key
“data/root/foo/baz/c96/18”) is transformed to the shard grid position
[3, 9]
and reassigned to the respective new key, honoring the original chunk
separator (e.g. “data/root/foo/baz/c3/9”). Chunk grid positions [96, 19]
,
[97, 18]
, …, up to [127, 19]
will also have the same shard grid position
[3, 9]
.
Binary shard format#
The only binary format at the moment is an indexed binary format as described below. Other binary formats might be added in future versions.
In the indexed binary format, chunks are written successively in a shard, where
unused space between them is allowed, followed by an index referencing them. The
index is placed at the end of the file and has a size of 16 bytes multiplied by
the number of chunks in a shard, for example 16 bytes * 64 = 1014 bytes
for
chunks_per_shard=[32, 2]
. The index holds an offset, nbytes pair of
little-endian uint64 per chunk, the chunks-order in the index is row-major (C)
order, for example for chunks_per_shard=[2, 2]
an index would look like:
| chunk (0, 0) | chunk (0, 1) | chunk (1, 0) | chunk (1, 1) |
| offset | nbytes | offset | nbytes | offset | nbytes | offset | nbytes |
| uint64 | uint64 | uint64 | uint64 | uint64 | uint64 | uint64 | uint64 |
Empty chunks are denoted by setting both offset and nbytes to 2^64 - 1
. The
index always has the full shape of all possible chunks per shard, even if they
are outside of the array size.
The actual order of the chunk content is not fixed and may be chosen by the implementation as all possible write orders are valid according to this specification and therefore can be read by any other implementation. When writing partial chunks into an existing shard no specific order of the existing chunks may be expected. Some writing strategies might be
Fixed order: Specify a fixed order (e.g. row-, column-major, or Morton order). When replacing existing chunks larger or equal-sized chunks may be replaced in-place, leaving unused space up to an upper limit that might possibly be specified. Please note that for regular-sized uncompressed data all chunks have the same size and can therefore be replaced in-place. > *
Append-only: Any chunk to write is appended to the existing shard, followed by an updated index. If previous chunks are updated, their storage space becomes unused, as well as the previous index. This might be useful for storage that only allows append-only updates.
Other formats: Other formats that accept additional bytes at the end of the file (such as HDF) could be used for storing shards, by writing the chunks in the order the format prescribes and appending a binary index derived from the byte offsets and lengths at the end of the file.
Any configuration parameters for the write strategy must not be part of the metadata document, they need to be configured at runtime, as this is implementation specific.
API implementation#
The section below defines an implementation of the
Abstract store interface in terms of the operations of this storage
transformer as a StoreWithPartialAccess
. The term underlying store
references either the next storage transformer in the stack or the actual store
if this transformer is the last one in the stack. Any operations with keys not
starting with data/root
are simply relayed to the underlying store and not
described explicitly.
get_partial_values(key_ranges) -> values
: For each referenced key, request the indices from the underlying store usingget_partial_values
. For each key, range pair in in key_ranges, check if the chunk exists by checking if the index offset and nbytes are both2^64 - 1
. For existing keys, request the actual chunks by their ranges as read from the index usingget_partial_values
. This operation should be implemented using twoget_partial_values
operations on the underlying store, one for retrieving the indices and one for retrieving existing chunks.set_partial_values(key_start_values)
: For each referenced key, check if all available chunks in a shard are referenced. In this case, a shard can be constructed according to the Binary shard format directly. For all other keys, request the indices from the underlying store usingget_partial_values
. All chunks that are not updated completely and exist according to the index (index offset and nbytes are both2^64 - 1
) need to be read viaget_partial_values
from the underlying store. For simplification purposes a shard may also be read completely, combining the previous two get operations into one. Based on the existing chunks and value ranges that need to be updated new shards are constructed according to the Binary shard format. All shards that need to be updated must now be set viaset
orset_partial_values(key_start_values)
, depending on the chosen writing strategy provided by the implementation. Specialized store implementations that allow appending to a storage object may only need to read the index to update it.erase_values(keys)
: For each referenced key, check if all available chunks in a shard are referenced. In this case, the full shard is removed usingerase_values
on the underlying store. For all other keys, request the indices from the underlying store usingget_partial_values
. Update the index using an offset and nbytes of2^64 - 1
to mark missing chunks. The updated index may be written in-place usingset_partial_values(key_start_values)
, or a larger rewrite of the shard may be done including the index update, but also removing value ranges corresponding to the erased chunks.erase_prefix()
: If the prefix contains a part of the chunk-grid key, this part is translated to the referenced shard and contained chunks. For affected shards where all contained chunks are erased the prefix is rewritten to the corresponding shard key and the operation is relayed to the underlying store. For all shards where only some chunks are erased the affected chunks are removed by invoking the operationerase_values
on this storage transformer with the respective chunk keys.list()
: Seelist_prefix
with the prefix/
.list_prefix(prefix)
: If the prefix contains a part of the chunk-grid key, this part is translated to the referenced shard and contained chunks. Then,list_prefix
is called on the underlying store with the translated prefix. For all listed shards request the indices from the underlying store usingget_partial_values
. Existing chunks, where the index offset or nbytes are not2^64 - 1
are then listed by their original key.list_dir(prefix)
: If the prefix contains a part of the chunk-grid key, this part is translated to the referenced shard and contained chunks. Then,list_dir
is called on the underlying store with the translated prefix. For all retrieved prefixes (not full keys) with partial shard keys, the corresponding original prefixes covering all possible chunks in the shard are listed. For retrieved full keys the indices from the underlying store are requested usingget_partial_values
. Existing chunks, where the index offset or nbytes are not2^64 - 1
are then listed by their original key.Note
Not all listed prefixes must necessarily contain keys, as shard prefixes with partially available chunks return prefixes for all possible chunks without verifying their existence for performance reasons. Listing those prefixes is still safe as some chunks in their corresponding shard exist, but not necessarily in the requested prefix, possibly leading to empty responses. Please note that this only applies to returned prefixes, not for full keys referencing storage objects. Returned full keys always reflect the available chunks and are safe to request.
References#
- RFC2119(1,2)
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://tools.ietf.org/html/rfc2119
Change log#
This section is a placeholder for keeping a log of the snapshots of this document that are tagged in GitHub and what changed between them.