Sharding codec (version 1.0)#

Editor’s draft 23 03 2023#

Specification URI:

https://zarr-specs.readthedocs.io/en/latest/v3/codecs/sharding-indexed/v1.0.html

Issue tracking:

GitHub issues <https://github.com/zarr-developers/zarr-specs/labels/sharding-indexed-codec-v1.0>-

Suggest an edit for this spec:

GitHub editor

Copyright 2022-Present Zarr core development team. This work is licensed under a Creative Commons Attribution 3.0 Unported License.


Abstract#

This specification defines a Zarr array -> bytes codec for sharding.

Sharding logically splits chunks (“shards”) into sub-chunks (“inner chunks”) that can be individually compressed and accessed. This allows to colocate multiple chunks within one storage object, bundling them in shards.

Motivation#

In many cases, it becomes inefficient or impractical to store a large number of chunks as separate files or objects due to the design constraints of the underlying storage. For example, the file block size and maximum inode number restrict the usage of numerous small files for typical file systems, also cloud storage such as S3, GCS, and various distributed filesystems do not efficiently handle large numbers of small files or objects.

Increasing the chunk size works only up to a certain point, as chunk sizes need to be small for read efficiency requirements, for example to stream data in browser-based visualization software.

Therefore, chunks may need to be smaller than the minimum size of one storage key. In those cases, it is efficient to store objects at a more coarse granularity than reading chunks.

Sharding solves this by allowing to store multiple chunks in one storage key, which is called a shard:

../../../_images/sharding.png

Document conventions#

Conformance requirements are expressed with a combination of descriptive assertions and [RFC2119] terminology. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in the normative parts of this document are to be interpreted as described in [RFC2119]. However, for readability, these words do not appear in all uppercase letters in this specification.

All of the text of this specification is normative except sections explicitly marked as non-normative, examples, and notes. Examples in this specification are introduced with the words “for example”.

Codec name#

The value of the name member in the codec object MUST be sharding_indexed.

Configuration parameters#

Sharding can be configured per array in the Array metadata as follows:

{
  codecs: [
    {
      "name": "sharding_indexed"
      "configuration": {
        "chunk_shape": [32, 32],
        "codecs": [
          {
            "name": "gzip",
            "configuration": {
              "level": 1
            }
          }
        ]
      }
    }
  ]
}

chunk_shape

An array of integers providing the shape of inner chunks in a shard for each dimension of the Zarr array. The length of the array must match the length of the array metadata shape entry. The each integer must by divisible by the chunk_shape of the array as defined in the chunk_grid Array metadata. For example, an inner chunk shape of [32, 2] with an outer chunk shape [64, 64] indicates that 64 chunks are combined in one shard, 2 along the first dimension, and for each of those 32 along the second dimension. Currently, only the regular chunk grid is supported.

codecs

Specifies a list of codecs to be used for encoding and decoding inner chunks. The value must be an array of objects, as specified in the Array metadata. An absent codecs member is equivalent to specifying an empty list of codecs.

Binary shard format#

This is an array -> bytes codec.

In the sharding_indexed binary format, chunks are written successively in a shard, where unused space between them is allowed, followed by an index referencing them. The index is placed at the end of the file and has a size of 16 bytes multiplied by the number of chunks in a shard, for example 16 bytes * 4 = 1024 bytes for shard shape of [64, 64] and inner chunk shape of [32, 32]. The index holds an offset, nbytes pair of little-endian uint64 per chunk, the chunks-order in the index is row-major (C) order. Given the example of 2x2 inner chunks in a shard, the index would look like:

| chunk (0, 0)    | chunk (0, 1)    | chunk (1, 0)    | chunk (1, 1)    |
| offset | nbytes | offset | nbytes | offset | nbytes | offset | nbytes |
| uint64 | uint64 | uint64 | uint64 | uint64 | uint64 | uint64 | uint64 |

Empty chunks are denoted by setting both offset and nbytes to 2^64 - 1. Empty chunks are interpreted as being filled with the fill value. The index always has the full shape of all possible chunks per shard, even if they extend beyond the array shape.

The actual order of the chunk content is not fixed and may be chosen by the implementation. All possible write orders are valid according to this specification and therefore can be read by any other implementation. When writing partial chunks into an existing shard, no specific order of the existing chunks may be expected. Some writing strategies might be

  • Fixed order: Specify a fixed order (e.g. row-, column-major, or Morton order). When replacing existing chunks larger or equal-sized chunks may be replaced in-place, leaving unused space up to an upper limit that might possibly be specified. Please note that, for regular-sized uncompressed data, all chunks have the same size and can therefore be replaced in-place.

  • Append-only: Any chunk to write is appended to the existing shard, followed by an updated index. If previous chunks are updated, their storage space becomes unused, as well as the previous index. This might be useful for storage that only allows append-only updates.

  • Other formats: Other formats that accept additional bytes at the end of the file (such as HDF) could be used for storing shards, by writing the chunks in the order the format prescribes and appending a binary index derived from the byte offsets and lengths at the end of the file.

Any configuration parameters for the write strategy must not be part of the metadata document; instead they need to be configured at runtime, as this is implementation specific.

Implementation notes#

The section suggests a non-normative implementation of the codec including common optimizations.

  • Decoding: A simple implementation to decode chunks in a shard would (a) read the entire value from the store into a byte buffer, (b) parse the shard index as specified above from the end of the buffer and (c) cut out the relevant bytes that belong to the requested chunk. The relevant bytes are determined by the offset,nbytes pair in the shard index. This bytestream then needs to be decoded with the inner codecs as specified in the sharding configuration applying the Decoding procedure. This is similar to how an implementation would access a sub-slice of a chunk.

    When reading all chunks of a shard at once, a useful optimization would be to read the entire shard once into a byte buffer and then cut out and decode all chunks from that buffer in one pass.

    If the underlying store supports partial reads, the decoding of single inner chunks can be optimized. In that case, the shard index can be read from the store by requesting the n last bytes, where n is 16 bytes multiplied by the number of chunks in a shard. After parsing the shard index, single chunks can be requested from the store by specifying the byte range. The bytestream, then, needs to be decoded as above.

  • Encoding: A simple implementation to encode a chunk in a shard would (a) encode the new chunk per Encoding procedure in a byte buffer using the shard’s inner codecs, (b) read an existing shard from the store, (c) create a new bytestream with all encoded chunks of that shard including the overwritten chunk, (d) generate a new shard index that is appended to the chunk bytestream and (e) writes the shard to the store. If there was no existing shard, an empty shard is assumed. When writing entire chunks, reading the existing shard first may be skipped.

    When working with chunks that have a fixed byte size (e.g., uncompressed) and a store that supports partial writes, a optimization would be to replace the new chunk by writing to the store at the specified byte range.

    Other use case-specific optimizations may be available, e.g., for append-only workloads.

References#

RFC2119(1,2)

S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://tools.ietf.org/html/rfc2119

Change log#

This section is a placeholder for keeping a log of the snapshots of this document that are tagged in GitHub and what changed between them.