Skip to content

API reference

VirtualiZarr has a small API surface, because most of the complexity is handled by xarray functions like xarray.concat and xarray.merge. Users can use xarray for every step apart from reading and serializing virtual references.

User API

Reading

virtualizarr.open_virtual_dataset

open_virtual_dataset(
    file_url: str,
    object_store: ObjectStore,
    parser: Parser,
    drop_variables: Iterable[str] | None = None,
    loadable_variables: Iterable[str] | None = None,
    decode_times: bool | None = None,
    cftime_variables: Iterable[str] | None = None,
    indexes: Mapping[str, Index] | None = None,
) -> Dataset

virtualizarr.open_virtual_mfdataset

open_virtual_mfdataset(
    paths: str
    | PathLike
    | Sequence[str | PathLike]
    | NestedSequence[str | PathLike],
    object_store: ObjectStore,
    parser: Parser,
    concat_dim: str
    | DataArray
    | Index
    | Sequence[str]
    | Sequence[DataArray]
    | Sequence[Index]
    | None = None,
    compat: "CompatOptions" = "no_conflicts",
    preprocess: Callable[[Dataset], Dataset] | None = None,
    data_vars: Literal["all", "minimal", "different"] | list[str] = "all",
    coords="different",
    combine: Literal["by_coords", "nested"] = "by_coords",
    parallel: Literal["dask", "lithops", False] | type[Executor] = False,
    join: "JoinOptions" = "outer",
    attrs_file: str | PathLike | None = None,
    combine_attrs: "CombineAttrsOptions" = "override",
    **kwargs,
) -> Dataset

Open multiple files as a single virtual dataset.

This function is explicitly modelled after xarray.open_mfdataset, and works in the same way.

If combine='by_coords' then the function combine_by_coords is used to combine the datasets into one before returning the result, and if combine='nested' then combine_nested is used. The filepaths must be structured according to which combining function is used, the details of which are given in the documentation for combine_by_coords and combine_nested. By default combine='by_coords' will be used. Global attributes from the attrs_file are used for the combined dataset.

Parameters:

  • paths (str | PathLike | Sequence[str | PathLike] | NestedSequence[str | PathLike]) –

    Same as in xarray.open_mfdataset

  • concat_dim (str | DataArray | Index | Sequence[str] | Sequence[DataArray] | Sequence[Index] | None, default: None ) –

    Same as in xarray.open_mfdataset

  • compat ('CompatOptions', default: 'no_conflicts' ) –

    Same as in xarray.open_mfdataset

  • preprocess (Callable[[Dataset], Dataset] | None, default: None ) –

    Same as in xarray.open_mfdataset

  • data_vars (Literal['all', 'minimal', 'different'] | list[str], default: 'all' ) –

    Same as in xarray.open_mfdataset

  • coords

    Same as in xarray.open_mfdataset

  • combine (Literal['by_coords', 'nested'], default: 'by_coords' ) –

    Same as in xarray.open_mfdataset

  • parallel ("dask", "lithops", False, or type of subclass of ``concurrent.futures.Executor``, default: False ) –

    Specify whether the open and preprocess steps of this function will be performed in parallel using lithops, dask.delayed, or any executor compatible with the concurrent.futures interface, or in serial. Default is False, which will execute these steps in serial.

  • join ('JoinOptions', default: 'outer' ) –

    Same as in xarray.open_mfdataset

  • attrs_file (str | PathLike | None, default: None ) –

    Same as in xarray.open_mfdataset

  • combine_attrs ('CombineAttrsOptions', default: 'override' ) –

    Same as in xarray.open_mfdataset

  • **kwargs (optional, default: {} ) –

    Additional arguments passed on to virtualizarr.open_virtual_dataset. For an overview of some of the possible options, see the documentation of virtualizarr.open_virtual_dataset.

Returns:

Notes

The results of opening each virtual dataset in parallel are sent back to the client process, so must not be too large. See the docs page on Scaling.

Parsers

Each parser understands how to read a specific file format, and a parser must be passed to virtualizarr.open_virtual_dataset.

virtualizarr.parsers.DMRPPParser

__call__

__call__(file_url: str, object_store: ObjectStore) -> ManifestStore

Parse the metadata and byte offsets from a given file to product a VirtualiZarr ManifestStore.

Parameters:

  • file_url (str) –

    The URI or path to the input file (e.g., "s3://bucket/file.dmrpp").

  • object_store (ObjectStore) –

    An obstore ObjectStore instance for accessing the file specified in the file_url parameter.

Returns:

  • ManifestStore

    A ManifestStore that provides a Zarr representation of the parsed file.

__init__

__init__(group: str | None = None, skip_variables: Iterable[str] | None = None)

Instantiate a parser with parser-specific parameters that can be used in the call method.

Parameters:

  • group (str | None, default: None ) –

    The group within the file to be used as the Zarr root group for the ManifestStore.

  • skip_variables (Iterable[str] | None, default: None ) –

    Variables in the file that will be ignored when creating the ManifestStore.

virtualizarr.parsers.FITSParser

__call__

__call__(file_url: str, object_store: ObjectStore) -> ManifestStore

Parse the contents of a FITS file to produce a ManifestStore.

Parameters:

  • file_url (str) –

    The URI or path to the input file (e.g., "s3://bucket/file.fits").

  • object_store (ObjectStore) –

    An obstore ObjectStore instance for accessing the file specified in the file_url parameter.

Returns:

  • ManifestStore

    A ManifestStore which provides a Zarr representation of the parsed file.

__init__

__init__(
    group: str | None = None,
    skip_variables: Iterable[str] | None = None,
    reader_options: Optional[dict] = None,
)

Instantiate a parser with parser-specific parameters that can be used in the __call__ method.

Parameters:

  • group (str | None, default: None ) –

    The group within the file to be used as the Zarr root group for the ManifestStore.

  • skip_variables (Iterable[str] | None, default: None ) –

    Variables in the file that will be ignored when creating the ManifestStore.

  • reader_options (Optional[dict], default: None ) –

    Configuration options used internally for kerchunk's fsspec backend.

virtualizarr.parsers.HDFParser

virtualizarr.parsers.NetCDF3Parser

__call__

__call__(file_url: str, object_store: ObjectStore) -> ManifestStore

Parse the metadata and byte offsets from a given file to product a VirtualiZarr ManifestStore.

Parameters:

  • file_url (str) –

    The URI or path to the input file (e.g., "s3://bucket/file.nc").

  • object_store (ObjectStore) –

    An obstore ObjectStore instance for accessing the file specified in the file_url parameter.

Returns:

  • ManifestStore

    A ManifestStore that provides a Zarr representation of the parsed file.

__init__

__init__(
    group: str | None = None,
    skip_variables: Iterable[str] | None = None,
    reader_options: dict | None = None,
)

Instantiate a parser with parser-specific parameters that can be used in the __call__ method.

Parameters:

  • group (str | None, default: None ) –

    The group within the file to be used as the Zarr root group for the ManifestStore.

  • skip_variables (Iterable[str] | None, default: None ) –

    Variables in the file that will be ignored when creating the ManifestStore.

  • reader_options (dict | None, default: None ) –

    Configuration options used internally for the kerchunk's fsspec backend.

virtualizarr.parsers.KerchunkJSONParser

__call__

__call__(file_url: str, object_store: ObjectStore) -> ManifestStore

Parse the metadata and byte offsets from a given file to produce a VirtualiZarr ManifestStore.

Parameters:

  • file_url (str) –

    The URI or path to the input file (e.g., "s3://bucket/kerchunk.json").

  • object_store (ObjectStore) –

    An obstore ObjectStore instance for accessing the file specified in the file_url parameter.

Returns:

  • ManifestStore

    A ManifestStore that provides a Zarr representation of the parsed file.

__init__

__init__(
    group: str | None = None,
    fs_root: str | None = None,
    skip_variables: Iterable[str] | None = None,
    store_registry: ObjectStoreRegistry | None = None,
)

Instantiate a parser with parser-specific parameters that can be used in the __call__ method.

Parameters:

  • group (str | None, default: None ) –

    The group within the file to be used as the Zarr root group for the ManifestStore.

  • fs_root (str | None, default: None ) –

    The qualifier to be used for kerchunk references containing relative paths.

  • skip_variables (Iterable[str] | None, default: None ) –

    Variables in the file that will be ignored when creating the ManifestStore.

  • store_registry (ObjectStoreRegistry | None, default: None ) –

    A user defined ObjectStoreRegistry to be used for reading data for kerchunk references contain paths to multiple locations.

virtualizarr.parsers.KerchunkParquetParser

__call__

__call__(file_url: str, object_store: ObjectStore) -> ManifestStore

Parse the metadata and byte offsets from a given file to product a VirtualiZarr ManifestStore.

Parameters:

  • file_url (str) –

    The URI or path to the input parquet directory (e.g., "s3://bucket/file.parq").

  • object_store (ObjectStore) –

    An obstore ObjectStore instance for accessing the file specified in the file_url parameter.

Returns:

  • ManifestStore

    A ManifestStore which provides a Zarr representation of the parsed file.

__init__

__init__(
    group: str | None = None,
    fs_root: str | None = None,
    skip_variables: Iterable[str] | None = None,
    reader_options: dict | None = None,
)

Instantiate a parser with parser-specific parameters that can be used in the __call__ method.

Parameters:

  • group (str | None, default: None ) –

    The group within the file to be used as the Zarr root group for the ManifestStore.

  • fs_root (str | None, default: None ) –

    The qualifier to be used for kerchunk references containing relative paths.

  • skip_variables (Iterable[str] | None, default: None ) –

    Variables in the file that will be ignored when creating the ManifestStore.

  • reader_options (dict | None, default: None ) –

    Configuration options used internally for the fsspec backend.

virtualizarr.parsers.ZarrParser

__call__

__call__(file_url: str, object_store: ObjectStore) -> ManifestStore

Parse the metadata and byte offsets from a given Zarr store to produce a VirtualiZarr ManifestStore.

Parameters:

  • file_url (str) –

    The URI or path to the input Zarr store (e.g., "s3://bucket/store.zarr").

  • object_store (ObjectStore) –

    An obstore ObjectStore instance for accessing the directory specified in the file_url parameter.

Returns:

  • ManifestStore ( A ManifestStore which provides a Zarr representation of the parsed file. ) –

__init__

__init__(group: str | None = None, skip_variables: Iterable[str] | None = None)

Instantiate a parser with parser-specific parameters that can be used in the __call__ method.

Parameters:

  • group (str | None, default: None ) –

    The group within the file to be used as the Zarr root group for the ManifestStore (default: the file's root group).

  • skip_variables (Iterable[str] | None, default: None ) –

    Variables in the file that will be ignored when creating the ManifestStore (default: None, do not ignore any variables).

Serialization

virtualizarr.accessor.VirtualiZarrDatasetAccessor

Xarray accessor for writing out virtual datasets to disk.

Methods on this object are called via ds.virtualize.{method}.

nbytes property

nbytes: int

Size required to hold these references in memory in bytes.

Note this is not the size of the referenced chunks if they were actually loaded into memory, this is only the size of the pointers to the chunk locations. If you were to load the data into memory it would be ~1e6x larger for 1MB chunks.

In-memory (loadable) variables are included in the total using xarray's normal .nbytes method.

rename_paths

rename_paths(new: str | Callable[[str], str]) -> Dataset

Rename paths to chunks in every ManifestArray in this dataset.

Accepts either a string, in which case this new path will be used for all chunks, or a function which accepts the old path and returns the new path.

Parameters:

  • new (str | Callable[[str], str]) –

    New path to use for all chunks, either as a string, or as a function which accepts and returns strings.

Returns:

  • Dataset
See Also

virtualizarr.ManifestArray.rename_paths

virtualizarr.ChunkManifest.rename_paths

Examples:

Rename paths to reflect moving the referenced files from local storage to an S3 bucket.

>>> def local_to_s3_url(old_local_path: str) -> str:
...     from pathlib import Path
...
...     new_s3_bucket_url = "http://s3.amazonaws.com/my_bucket/"
...
...     filename = Path(old_local_path).name
...     return str(new_s3_bucket_url / filename)
>>>
>>> ds.virtualize.rename_paths(local_to_s3_url)

to_icechunk

to_icechunk(
    store: IcechunkStore,
    *,
    group: str | None = None,
    append_dim: str | None = None,
    last_updated_at: datetime | None = None,
) -> None

Write an xarray dataset to an Icechunk store.

Any variables backed by ManifestArray objects will be be written as virtual references. Any other variables will be loaded into memory before their binary chunk data is written into the store.

If append_dim is provided, the virtual dataset will be appended to the existing IcechunkStore along the append_dim dimension.

If last_updated_at is provided, it will be used as a checksum for any virtual chunks written to the store with this operation. At read time, if any of the virtual chunks have been updated since this provided datetime, an error will be raised. This protects against reading outdated virtual chunks that have been updated since the last read. When not provided, no check is performed. This value is stored in Icechunk with seconds precision, so be sure to take that into account when providing this value.

Parameters:

  • store (IcechunkStore) –

    Store to write dataset into.

  • group (str | None, default: None ) –

    Path of the group to write the dataset into (default: the root group).

  • append_dim (str | None, default: None ) –

    Dimension along which to append the virtual dataset.

  • last_updated_at (datetime | None, default: None ) –

    Datetime to use as a checksum for any virtual chunks written to the store with this operation. When not provided, no check is performed.

Raises:

Examples:

To ensure an error is raised if the files containing referenced virtual chunks are modified at any time from now on, pass the current time to last_updated_at.

>>> from datetime import datetime
>>> vds.virtualize.to_icechunk(
...     icechunkstore,
...     last_updated_at=datetime.now(),
... )

to_kerchunk

to_kerchunk(filepath: None, format: Literal['dict']) -> KerchunkStoreRefs
to_kerchunk(filepath: str | Path, format: Literal['json']) -> None
to_kerchunk(
    filepath: str | Path,
    format: Literal["parquet"],
    record_size: int = 100000,
    categorical_threshold: int = 10,
) -> None
to_kerchunk(
    filepath: str | Path | None = None,
    format: Literal["dict", "json", "parquet"] = "dict",
    record_size: int = 100000,
    categorical_threshold: int = 10,
) -> KerchunkStoreRefs | None

Serialize all virtualized arrays in this xarray dataset into the kerchunk references format.

Parameters:

  • filepath (str | Path | None, default: None ) –

    File path to write kerchunk references into. Not required if format is 'dict'.

  • format (Literal['dict', 'json', 'parquet'], default: 'dict' ) –

    Format to serialize the kerchunk references as. If 'json' or 'parquet' then the 'filepath' argument is required.

  • record_size (int, default: 100000 ) –

    Number of references to store in each reference file (default 100,000). Bigger values mean fewer read requests but larger memory footprint. Only available when format is 'parquet'.

  • categorical_threshold (int, default: 10 ) –

    Encode urls as pandas.Categorical to reduce memory footprint if the ratio of the number of unique urls to total number of refs for each variable is greater than or equal to this number (default 10). Only available when format is 'parquet'.

References

fsspec.github.io/kerchunk/spec.html

virtualizarr.accessor.VirtualiZarrDataTreeAccessor

Xarray accessor for writing out virtual datatrees to disk.

Methods on this object are called via dt.virtualize.{method}.

to_icechunk

to_icechunk(
    store: IcechunkStore,
    *,
    write_inherited_coords: bool = False,
    last_updated_at: datetime | None = None,
) -> None

Write an xarray DataTree to an Icechunk store.

Any variables backed by ManifestArray objects will be be written as virtual references. Any other variables will be loaded into memory before their binary chunk data is written into the store.

If last_updated_at is provided, it will be used as a checksum for any virtual chunks written to the store with this operation. At read time, if any of the virtual chunks have been updated since this provided datetime, an error will be raised. This protects against reading outdated virtual chunks that have been updated since the last read. When not provided, no check is performed. This value is stored in Icechunk with seconds precision, so be sure to take that into account when providing this value.

Parameters:

  • store (IcechunkStore) –

    Store to write dataset into.

  • write_inherited_coords (bool, default: False ) –

    If True, replicate inherited coordinates on all descendant nodes. Otherwise, only write coordinates at the level at which they are originally defined. This saves disk space, but requires opening the full tree to load inherited coordinates.

  • last_updated_at (datetime | None, default: None ) –

    Datetime to use as a checksum for any virtual chunks written to the store with this operation. When not provided, no check is performed.

Raises:

Examples:

To ensure an error is raised if the files containing referenced virtual chunks are modified at any time from now on, pass the current time to last_updated_at.

>>> from datetime import datetime
>>> vdt.virtualize.to_icechunk(
...     icechunkstore,
...     last_updated_at=datetime.now(),
... )

Information

virtualizarr.accessor.VirtualiZarrDatasetAccessor.nbytes property

nbytes: int

Size required to hold these references in memory in bytes.

Note this is not the size of the referenced chunks if they were actually loaded into memory, this is only the size of the pointers to the chunk locations. If you were to load the data into memory it would be ~1e6x larger for 1MB chunks.

In-memory (loadable) variables are included in the total using xarray's normal .nbytes method.

Rewriting

virtualizarr.accessor.VirtualiZarrDatasetAccessor.rename_paths

rename_paths(new: str | Callable[[str], str]) -> Dataset

Rename paths to chunks in every ManifestArray in this dataset.

Accepts either a string, in which case this new path will be used for all chunks, or a function which accepts the old path and returns the new path.

Parameters:

  • new (str | Callable[[str], str]) –

    New path to use for all chunks, either as a string, or as a function which accepts and returns strings.

Returns:

  • Dataset
See Also

virtualizarr.ManifestArray.rename_paths

virtualizarr.ChunkManifest.rename_paths

Examples:

Rename paths to reflect moving the referenced files from local storage to an S3 bucket.

>>> def local_to_s3_url(old_local_path: str) -> str:
...     from pathlib import Path
...
...     new_s3_bucket_url = "http://s3.amazonaws.com/my_bucket/"
...
...     filename = Path(old_local_path).name
...     return str(new_s3_bucket_url / filename)
>>>
>>> ds.virtualize.rename_paths(local_to_s3_url)

Developer API

If you want to write a new parser to create virtual references pointing to a custom file format, you will need to use VirtualiZarr's internal classes. See the page on custom parsers for more information.

Manifests

VirtualiZarr uses these classes to store virtual references internally. See the page on data structures for more information.

virtualizarr.manifests.ChunkManifest

In-memory representation of a single Zarr chunk manifest.

Stores the manifest internally as numpy arrays, so the most efficient way to create this object is via the .from_arrays constructor classmethod.

The manifest can be converted to or from a dictionary which looks like this

{
    "0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
    "0.0.1": {"path": "s3://bucket/foo.nc", "offset": 200, "length": 100},
    "0.1.0": {"path": "s3://bucket/foo.nc", "offset": 300, "length": 100},
    "0.1.1": {"path": "s3://bucket/foo.nc", "offset": 400, "length": 100},
}

using the .init() and .dict() methods, so users of this class can think of the manifest as if it were a dict mapping zarr chunk keys to byte ranges.

(See the chunk manifest SPEC proposal in zarr-developers/zarr-specs#287.)

Validation is done when this object is instantiated, and this class is immutable, so it's not possible to have a ChunkManifest object that does not represent a valid grid of chunks.

nbytes property

nbytes: int

Size required to hold these references in memory in bytes.

Note this is not the size of the referenced chunks if they were actually loaded into memory, this is only the size of the pointers to the chunk locations. If you were to load the data into memory it would be ~1e6x larger for 1MB chunks.

ndim_chunk_grid property

ndim_chunk_grid: int

Number of dimensions in the chunk grid.

Not the same as the dimension of an array backed by this chunk manifest.

shape_chunk_grid property

shape_chunk_grid: tuple[int, ...]

Number of separate chunks along each dimension.

Not the same as the shape of an array backed by this chunk manifest.

__eq__

__eq__(other: Any) -> bool

Two manifests are equal if all of their entries are identical.

__init__

__init__(entries: dict, shape: tuple[int, ...] | None = None) -> None

Create a ChunkManifest from a dictionary mapping zarr chunk keys to byte ranges.

Parameters:

  • entries (dict) –

    Chunk keys and byte range information, as a dictionary of the form

    {
        "0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
        "0.0.1": {"path": "s3://bucket/foo.nc", "offset": 200, "length": 100},
        "0.1.0": {"path": "s3://bucket/foo.nc", "offset": 300, "length": 100},
        "0.1.1": {"path": "s3://bucket/foo.nc", "offset": 400, "length": 100},
    }
    

dict

dict() -> ChunkDict

Convert the entire manifest to a nested dictionary.

The returned dict will be of the form

{
    "0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
    "0.0.1": {"path": "s3://bucket/foo.nc", "offset": 200, "length": 100},
    "0.1.0": {"path": "s3://bucket/foo.nc", "offset": 300, "length": 100},
    "0.1.1": {"path": "s3://bucket/foo.nc", "offset": 400, "length": 100},
}

Entries whose path is an empty string will be interpreted as missing chunks and omitted from the dictionary.

from_arrays classmethod

from_arrays(
    *,
    paths: ndarray[Any, StringDType],
    offsets: ndarray[Any, dtype[uint64]],
    lengths: ndarray[Any, dtype[uint64]],
    validate_paths: bool = True,
) -> ChunkManifest

Create manifest directly from numpy arrays containing the path and byte range information.

Useful if you want to avoid the memory overhead of creating an intermediate dictionary first, as these 3 arrays are what will be used internally to store the references.

Parameters:

  • paths (ndarray[Any, StringDType]) –

    Array containing the paths to the chunks

  • offsets (ndarray[Any, dtype[uint64]]) –

    Array containing the byte offsets of the chunks

  • lengths (ndarray[Any, dtype[uint64]]) –

    Array containing the byte lengths of the chunks

  • validate_paths (bool, default: True ) –

    Check that entries in the manifest are valid paths (e.g. that local paths are absolute not relative). Set to False to skip validation for performance reasons.

rename_paths

rename_paths(new: str | Callable[[str], str]) -> ChunkManifest

Rename paths to chunks in this manifest.

Accepts either a string, in which case this new path will be used for all chunks, or a function which accepts the old path and returns the new path.

Parameters:

  • new (str | Callable[[str], str]) –

    New path to use for all chunks, either as a string, or as a function which accepts and returns strings.

Returns:

  • manifest
See Also

ManifestArray.rename_paths

Examples:

Rename paths to reflect moving the referenced files from local storage to an S3 bucket.

>>> def local_to_s3_url(old_local_path: str) -> str:
...     from pathlib import Path
...
...     new_s3_bucket_url = "http://s3.amazonaws.com/my_bucket/"
...
...     filename = Path(old_local_path).name
...     return str(new_s3_bucket_url / filename)
>>>
>>> manifest.rename_paths(local_to_s3_url)

virtualizarr.manifests.ManifestArray

Virtualized array representation of the chunk data in a single Zarr Array.

Supports concatenation / stacking, but only if the two arrays to be concatenated have the same codecs.

Cannot be directly altered.

Implements subset of the array API standard such that it can be wrapped by xarray. Doesn't store the zarr array name, zattrs or ARRAY_DIMENSIONS, as instead those can be stored on a wrapping xarray object.

chunks property

chunks: tuple[int, ...]

Individual chunk size by number of elements.

nbytes_virtual property

nbytes_virtual: int

Size required to hold these references in memory in bytes.

Note this is not the size of the referenced array if it were actually loaded into memory, this is only the size of the pointers to the chunk locations. If you were to load the data into memory it would be ~1e6x larger for 1MB chunks.

shape property

shape: tuple[int, ...]

Array shape by number of elements along each dimension.

__array_function__

__array_function__(func, types, args, kwargs) -> Any

Hook to teach this class what to do if np.concat etc. is called on it.

Use this instead of array_namespace so that we don't make promises we can't keep.

__array_ufunc__

__array_ufunc__(ufunc, method, *inputs, **kwargs) -> Any

We have to define this in order to convince xarray that this class is a duckarray, even though we will never support ufuncs.

__eq__

__eq__(other: Union[int, float, bool, ndarray, ManifestArray]) -> ndarray

Element-wise equality checking.

Returns a numpy array of booleans.

__getitem__

__getitem__(
    key: Union[
        int,
        slice,
        EllipsisType,
        None,
        tuple[Union[int, slice, EllipsisType, None, ndarray], ...],
        ndarray,
    ],
) -> ManifestArray

Only supports extremely limited indexing.

Only here because xarray will apparently attempt to index into its lazy indexing classes even if the operation would be a no-op anyway.

__init__

__init__(
    metadata: ArrayV3Metadata | dict, chunkmanifest: dict | ChunkManifest
) -> None

Create a ManifestArray directly from the metadata of a zarr array and the manifest of chunks.

Parameters:

astype

astype(dtype: dtype, /, *, copy: bool = True) -> ManifestArray

Cannot change the dtype, but needed because xarray will call this even when it's a no-op.

rename_paths

rename_paths(new: str | Callable[[str], str]) -> ManifestArray

Rename paths to chunks in this array's manifest.

Accepts either a string, in which case this new path will be used for all chunks, or a function which accepts the old path and returns the new path.

Parameters:

  • new (str | Callable[[str], str]) –

    New path to use for all chunks, either as a string, or as a function which accepts and returns strings.

Returns:

See Also

ChunkManifest.rename_paths

Examples:

Rename paths to reflect moving the referenced files from local storage to an S3 bucket.

>>> def local_to_s3_url(old_local_path: str) -> str:
...     from pathlib import Path
...
...     new_s3_bucket_url = "http://s3.amazonaws.com/my_bucket/"
...
...     filename = Path(old_local_path).name
...     return str(new_s3_bucket_url / filename)
>>>
>>> marr.rename_paths(local_to_s3_url)

to_virtual_variable

to_virtual_variable() -> Variable

Create a "virtual" xarray.Variable containing the contents of one zarr array.

The returned variable will be "virtual", i.e. it will wrap a single ManifestArray object.

virtualizarr.manifests.ManifestGroup

Bases: Mapping[str, 'ManifestArray | ManifestGroup']

Immutable representation of a single virtual zarr group.

arrays property

ManifestArrays contained in this group.

groups property

groups: dict[str, 'ManifestGroup']

Subgroups contained in this group.

metadata property

metadata: GroupMetadata

Zarr group metadata.

__getitem__

__getitem__(path: str) -> 'ManifestArray | ManifestGroup'

Obtain a group member.

__init__

__init__(
    arrays: Mapping[str, ManifestArray] | None = None,
    groups: Mapping[str, "ManifestGroup"] | None = None,
    attributes: dict | None = None,
) -> None

Create a ManifestGroup containing ManifestArrays and/or sub-groups, as well as any group-level metadata.

Parameters:

  • arrays (Mapping[str, ManifestArray], default: None ) –

    ManifestArray objects to represent virtual zarr arrays.

  • groups (Mapping[str, ManifestGroup], default: None ) –

    ManifestGroup objects to represent virtual zarr subgroups.

  • attributes (dict, default: None ) –

    Zarr attributes to add as zarr group metadata.

to_virtual_dataset

to_virtual_dataset() -> Dataset

Create a "virtual" xarray.Dataset containing the contents of one zarr group.

All variables in the returned Dataset will be "virtual", i.e. they will wrap ManifestArray objects.

virtualizarr.manifests.ManifestStore

Bases: Store

A read-only Zarr store that uses obstore to read data from inside arbitrary files on AWS, GCP, Azure, or a local filesystem.

The requests from the Zarr API are redirected using the :class:virtualizarr.manifests.ManifestGroup containing multiple :class:virtualizarr.manifests.ManifestArray, allowing for virtually interfacing with underlying data in other file formats.

Parameters:

  • group (ManifestGroup) –

    Root group of the store. Contains group metadata, ManifestArrays, and any subgroups.

  • store_registry (ObjectStoreRegistry, default: None ) –

    ObjectStoreRegistry that maps the URL scheme and netloc to ObjectStore instances, allowing ManifestStores to read from different ObjectStore instances.

Warnings

ManifestStore is experimental and subject to API changes without notice. Please raise an issue with any comments/concerns about the store.

Notes

Modified from zarr-developers/zarr-python!1661

__init__

__init__(
    group: ManifestGroup, *, store_registry: ObjectStoreRegistry | None = None
) -> None

Instantiate a new ManifestStore.

Parameters:

  • group (ManifestGroup) –

    Manifest Group containing Group metadata and mapping variable names to ManifestArrays

  • store_registry (ObjectStoreRegistry | None, default: None ) –

    A registry mapping the URL scheme and netloc to ObjectStore instances, allowing ManifestStores to read from different ObjectStore instances.

to_virtual_dataset

to_virtual_dataset(
    group="",
    loadable_variables: Iterable[str] | None = None,
    decode_times: bool | None = None,
    indexes: Mapping[str, Index] | None = None,
) -> "xr.Dataset"

Create a "virtual" xarray dataset containing the contents of one zarr group.

Will ignore the contents of any other groups in the store.

Requires xarray.

Parameters:

  • group (str, default: '' ) –
  • loadable_variables (Iterable[str], default: None ) –

Returns:

  • vds ( Dataset ) –

Array API

VirtualiZarr's virtualizarr.manifests.ManifestArray objects support a limited subset of the Python Array API standard in virtualizarr.manifests.array_api.

virtualizarr.manifests.array_api.concatenate

concatenate(
    arrays: tuple[ManifestArray, ...] | list[ManifestArray],
    /,
    *,
    axis: int | None = 0,
) -> ManifestArray

Concatenate ManifestArrays by merging their chunk manifests.

The signature of this function is array API compliant, so that it can be called by xarray.concat.

virtualizarr.manifests.array_api.stack

stack(
    arrays: tuple[ManifestArray, ...] | list[ManifestArray], /, *, axis: int = 0
) -> ManifestArray

Stack ManifestArrays by merging their chunk manifests.

The signature of this function is array API compliant, so that it can be called by xarray.stack.

virtualizarr.manifests.array_api.expand_dims

expand_dims(x: ManifestArray, /, *, axis: int = 0) -> ManifestArray

Expands the shape of an array by inserting a new axis (dimension) of size one at the position specified by axis.

virtualizarr.manifests.array_api.broadcast_to

broadcast_to(x: ManifestArray, /, shape: tuple[int, ...]) -> ManifestArray

Broadcasts a ManifestArray to a specified shape, by either adjusting chunk keys or copying chunk manifest entries.

Parser typing protocol

All custom parsers must follow the virtualizarr.parsers.typing.Parser typing protocol.

virtualizarr.parsers.typing.Parser

Bases: Protocol

Parallelization

Parallelizing virtual reference generation can be done using a number of parallel execution frameworks. Advanced users may want to call one of these executors directly. See the docs page on Scaling.

virtualizarr.parallel.SerialExecutor

Bases: Executor

A custom Executor that runs tasks sequentially, mimicking the concurrent.futures.Executor interface. Useful as a default and for debugging.

map

map(
    fn: Callable[..., T],
    *iterables: Iterable[Any],
    timeout: float | None = None,
    chunksize: int = 1,
) -> Iterator[T]

Execute a function over an iterable sequentially.

Parameters:

  • fn (Callable[..., T]) –

    Function to apply to each item

  • *iterables (Iterable[Any], default: () ) –

    Iterables to process

  • timeout (float | None, default: None ) –

    Optional timeout (ignored in serial execution)

Returns:

  • Generator of results

shutdown

shutdown(wait: bool = True, *, cancel_futures: bool = False) -> None

Shutdown the executor.

Parameters:

  • wait (bool, default: True ) –

    Whether to wait for pending futures (always True for serial executor)

submit

submit(fn: Callable[..., T], /, *args: Any, **kwargs: Any) -> Future[T]

Submit a callable to be executed.

Unlike parallel executors, this runs the task immediately and sequentially.

Parameters:

  • fn (Callable[..., T]) –

    The callable to execute

  • *args (Any, default: () ) –

    Positional arguments for the callable

  • **kwargs (Any, default: {} ) –

    Keyword arguments for the callable

Returns:

  • A Future representing the result of the execution

virtualizarr.parallel.DaskDelayedExecutor

Bases: Executor

An Executor that uses dask.delayed for parallel computation.

This executor mimics the concurrent.futures.Executor interface but uses Dask's delayed computation model.

__init__

__init__() -> None

Initialize the Dask Delayed Executor.

map

map(
    fn: Callable[..., T],
    *iterables: Iterable[Any],
    timeout: float | None = None,
    chunksize: int = 1,
) -> Iterator[T]

Apply a function to an iterable using dask.delayed.

Parameters:

  • fn (Callable[..., T]) –

    Function to apply to each item

  • *iterables (Iterable[Any], default: () ) –

    Iterables to process

  • timeout (float | None, default: None ) –

    Optional timeout (ignored in serial execution)

Returns:

  • Generator of results

shutdown

shutdown(wait: bool = True, *, cancel_futures: bool = False) -> None

Shutdown the executor

Parameters:

  • wait (bool, default: True ) –

    Whether to wait for pending futures (always True for serial executor))

submit

submit(fn: Callable[..., T], /, *args: Any, **kwargs: Any) -> Future[T]

Submit a task to be computed with dask.delayed.

Parameters:

  • fn (Callable[..., T]) –

    The callable to execute

  • *args (Any, default: () ) –

    Positional arguments for the callable

  • **kwargs (Any, default: {} ) –

    Keyword arguments for the callable

Returns:

  • A Future representing the result of the execution

virtualizarr.parallel.LithopsEagerFunctionExecutor

Bases: Executor

Lithops-based function executor which follows the concurrent.futures.Executor API.

Only required because lithops doesn't follow the concurrent.futures.Executor API, see lithops-cloud/lithops#1427.

map

map(
    fn: Callable[..., T],
    *iterables: Iterable[Any],
    timeout: float | None = None,
    chunksize: int = 1,
) -> Iterator[T]

Apply a function to an iterable using lithops.

Only needed because lithops.FunctionExecutor.map returns futures, unlike concurrent.futures.Executor.map.

Parameters:

  • fn (Callable[..., T]) –

    Function to apply to each item

  • *iterables (Iterable[Any], default: () ) –

    Iterables to process

  • timeout (float | None, default: None ) –

    Optional timeout (ignored in serial execution)

Returns:

  • Generator of results

shutdown

shutdown(wait: bool = True, *, cancel_futures: bool = False) -> None

Shutdown the executor.

Parameters:

  • wait (bool, default: True ) –

    Whether to wait for pending futures.

submit

submit(fn: Callable[..., T], /, *args: Any, **kwargs: Any) -> Future[T]

Submit a task to be computed using lithops.

Parameters:

  • fn (Callable[..., T]) –

    The callable to execute

  • *args (Any, default: () ) –

    Positional arguments for the callable

  • **kwargs (Any, default: {} ) –

    Keyword arguments for the callable

Returns:

  • A concurrent.futures.Future representing the result of the execution