API reference¶

VirtualiZarr has a small API surface, because most of the complexity is handled by xarray functions like xarray.concat and xarray.merge. Users can use xarray for every step apart from reading and serializing virtual references.

User API¶

Reading¶

virtualizarr.open_virtual_dataset ¶

open_virtual_dataset(
    file_url: str,
    object_store: ObjectStore,
    parser: Parser,
    drop_variables: Iterable[str] | None = None,
    loadable_variables: Iterable[str] | None = None,
    decode_times: bool | None = None,
    cftime_variables: Iterable[str] | None = None,
    indexes: Mapping[str, Index] | None = None,
) -> Dataset

virtualizarr.open_virtual_mfdataset ¶

open_virtual_mfdataset(
    paths: str
    | PathLike
    | Sequence[str | PathLike]
    | NestedSequence[str | PathLike],
    object_store: ObjectStore,
    parser: Parser,
    concat_dim: str
    | DataArray
    | Index
    | Sequence[str]
    | Sequence[DataArray]
    | Sequence[Index]
    | None = None,
    compat: "CompatOptions" = "no_conflicts",
    preprocess: Callable[[Dataset], Dataset] | None = None,
    data_vars: Literal["all", "minimal", "different"] | list[str] = "all",
    coords="different",
    combine: Literal["by_coords", "nested"] = "by_coords",
    parallel: Literal["dask", "lithops", False] | type[Executor] = False,
    join: "JoinOptions" = "outer",
    attrs_file: str | PathLike | None = None,
    combine_attrs: "CombineAttrsOptions" = "override",
    **kwargs,
) -> Dataset

Open multiple files as a single virtual dataset.

This function is explicitly modelled after xarray.open_mfdataset, and works in the same way.

If combine='by_coords' then the function combine_by_coords is used to combine the datasets into one before returning the result, and if combine='nested' then combine_nested is used. The filepaths must be structured according to which combining function is used, the details of which are given in the documentation for combine_by_coords and combine_nested. By default combine='by_coords' will be used. Global attributes from the attrs_file are used for the combined dataset.

Parameters:

paths (str | PathLike | Sequence[str | PathLike] | NestedSequence[str | PathLike]) –

Same as in xarray.open_mfdataset
concat_dim (str | DataArray | Index | Sequence[str] | Sequence[DataArray] | Sequence[Index] | None, default: None ) –

Same as in xarray.open_mfdataset
compat ('CompatOptions', default: 'no_conflicts' ) –

Same as in xarray.open_mfdataset
preprocess (Callable[[Dataset], Dataset] | None, default: None ) –

Same as in xarray.open_mfdataset
data_vars (Literal['all', 'minimal', 'different'] | list[str], default: 'all' ) –

Same as in xarray.open_mfdataset
coords –

Same as in xarray.open_mfdataset
combine (Literal['by_coords', 'nested'], default: 'by_coords' ) –

Same as in xarray.open_mfdataset
parallel ("dask", "lithops", False, or type of subclass of ``concurrent.futures.Executor``, default: False ) –

Specify whether the open and preprocess steps of this function will be performed in parallel using lithops, dask.delayed, or any executor compatible with the concurrent.futures interface, or in serial. Default is False, which will execute these steps in serial.
join ('JoinOptions', default: 'outer' ) –

Same as in xarray.open_mfdataset
attrs_file (str | PathLike | None, default: None ) –

Same as in xarray.open_mfdataset
combine_attrs ('CombineAttrsOptions', default: 'override' ) –

Same as in xarray.open_mfdataset
**kwargs (optional, default: {} ) –

Additional arguments passed on to virtualizarr.open_virtual_dataset. For an overview of some of the possible options, see the documentation of virtualizarr.open_virtual_dataset.

Returns:

Dataset –

Notes

The results of opening each virtual dataset in parallel are sent back to the client process, so must not be too large. See the docs page on Scaling.

Parsers¶

Each parser understands how to read a specific file format, and a parser must be passed to virtualizarr.open_virtual_dataset.

virtualizarr.parsers.DMRPPParser ¶

call ¶

__call__(file_url: str, object_store: ObjectStore) -> ManifestStore

Parse the metadata and byte offsets from a given file to product a VirtualiZarr ManifestStore.

Parameters:

file_url (str) –

The URI or path to the input file (e.g., "s3://bucket/file.dmrpp").
object_store (ObjectStore) –

An obstore ObjectStore instance for accessing the file specified in the file_url parameter.

Returns:

ManifestStore –

A ManifestStore that provides a Zarr representation of the parsed file.

init ¶

__init__(group: str | None = None, skip_variables: Iterable[str] | None = None)

Instantiate a parser with parser-specific parameters that can be used in the call method.

Parameters:

group (str | None, default: None ) –

The group within the file to be used as the Zarr root group for the ManifestStore.
skip_variables (Iterable[str] | None, default: None ) –

Variables in the file that will be ignored when creating the ManifestStore.

virtualizarr.parsers.FITSParser ¶

call ¶

__call__(file_url: str, object_store: ObjectStore) -> ManifestStore

Parse the contents of a FITS file to produce a ManifestStore.

Parameters:

file_url (str) –

The URI or path to the input file (e.g., "s3://bucket/file.fits").
object_store (ObjectStore) –

An obstore ObjectStore instance for accessing the file specified in the file_url parameter.

Returns:

ManifestStore –

A ManifestStore which provides a Zarr representation of the parsed file.

init ¶

__init__(
    group: str | None = None,
    skip_variables: Iterable[str] | None = None,
    reader_options: Optional[dict] = None,
)

Instantiate a parser with parser-specific parameters that can be used in the __call__ method.

Parameters:

group (str | None, default: None ) –

The group within the file to be used as the Zarr root group for the ManifestStore.
skip_variables (Iterable[str] | None, default: None ) –

Variables in the file that will be ignored when creating the ManifestStore.
reader_options (Optional[dict], default: None ) –

Configuration options used internally for kerchunk's fsspec backend.

virtualizarr.parsers.HDFParser ¶

virtualizarr.parsers.NetCDF3Parser ¶

call ¶

__call__(file_url: str, object_store: ObjectStore) -> ManifestStore

Parse the metadata and byte offsets from a given file to product a VirtualiZarr ManifestStore.

Parameters:

file_url (str) –

The URI or path to the input file (e.g., "s3://bucket/file.nc").
object_store (ObjectStore) –

An obstore ObjectStore instance for accessing the file specified in the file_url parameter.

Returns:

ManifestStore –

A ManifestStore that provides a Zarr representation of the parsed file.

init ¶

__init__(
    group: str | None = None,
    skip_variables: Iterable[str] | None = None,
    reader_options: dict | None = None,
)

Instantiate a parser with parser-specific parameters that can be used in the __call__ method.

Parameters:

group (str | None, default: None ) –

The group within the file to be used as the Zarr root group for the ManifestStore.
skip_variables (Iterable[str] | None, default: None ) –

Variables in the file that will be ignored when creating the ManifestStore.
reader_options (dict | None, default: None ) –

Configuration options used internally for the kerchunk's fsspec backend.

virtualizarr.parsers.KerchunkJSONParser ¶

call ¶

__call__(file_url: str, object_store: ObjectStore) -> ManifestStore

Parse the metadata and byte offsets from a given file to produce a VirtualiZarr ManifestStore.

Parameters:

file_url (str) –

The URI or path to the input file (e.g., "s3://bucket/kerchunk.json").
object_store (ObjectStore) –

An obstore ObjectStore instance for accessing the file specified in the file_url parameter.

Returns:

ManifestStore –

A ManifestStore that provides a Zarr representation of the parsed file.

init ¶

__init__(
    group: str | None = None,
    fs_root: str | None = None,
    skip_variables: Iterable[str] | None = None,
    store_registry: ObjectStoreRegistry | None = None,
)

Instantiate a parser with parser-specific parameters that can be used in the __call__ method.

Parameters:

group (str | None, default: None ) –

The group within the file to be used as the Zarr root group for the ManifestStore.
fs_root (str | None, default: None ) –

The qualifier to be used for kerchunk references containing relative paths.
skip_variables (Iterable[str] | None, default: None ) –

Variables in the file that will be ignored when creating the ManifestStore.
store_registry (ObjectStoreRegistry | None, default: None ) –

A user defined ObjectStoreRegistry to be used for reading data for kerchunk references contain paths to multiple locations.

virtualizarr.parsers.KerchunkParquetParser ¶

call ¶

__call__(file_url: str, object_store: ObjectStore) -> ManifestStore

Parse the metadata and byte offsets from a given file to product a VirtualiZarr ManifestStore.

Parameters:

file_url (str) –

The URI or path to the input parquet directory (e.g., "s3://bucket/file.parq").
object_store (ObjectStore) –

An obstore ObjectStore instance for accessing the file specified in the file_url parameter.

Returns:

ManifestStore –

A ManifestStore which provides a Zarr representation of the parsed file.

init ¶

__init__(
    group: str | None = None,
    fs_root: str | None = None,
    skip_variables: Iterable[str] | None = None,
    reader_options: dict | None = None,
)

Instantiate a parser with parser-specific parameters that can be used in the __call__ method.

Parameters:

group (str | None, default: None ) –

The group within the file to be used as the Zarr root group for the ManifestStore.
fs_root (str | None, default: None ) –

The qualifier to be used for kerchunk references containing relative paths.
skip_variables (Iterable[str] | None, default: None ) –

Variables in the file that will be ignored when creating the ManifestStore.
reader_options (dict | None, default: None ) –

Configuration options used internally for the fsspec backend.

virtualizarr.parsers.ZarrParser ¶

call ¶

__call__(file_url: str, object_store: ObjectStore) -> ManifestStore

Parse the metadata and byte offsets from a given Zarr store to produce a VirtualiZarr ManifestStore.

Parameters:

file_url (str) –

The URI or path to the input Zarr store (e.g., "s3://bucket/store.zarr").
object_store (ObjectStore) –

An obstore ObjectStore instance for accessing the directory specified in the file_url parameter.

Returns:

ManifestStore ( A ManifestStore which provides a Zarr representation of the parsed file. ) –

init ¶

__init__(group: str | None = None, skip_variables: Iterable[str] | None = None)

Instantiate a parser with parser-specific parameters that can be used in the __call__ method.

Parameters:

group (str | None, default: None ) –

The group within the file to be used as the Zarr root group for the ManifestStore (default: the file's root group).
skip_variables (Iterable[str] | None, default: None ) –

Variables in the file that will be ignored when creating the ManifestStore (default: None, do not ignore any variables).

Serialization¶

virtualizarr.accessor.VirtualiZarrDatasetAccessor ¶

Xarray accessor for writing out virtual datasets to disk.

Methods on this object are called via ds.virtualize.{method}.

nbytes `property` ¶

nbytes: int

Size required to hold these references in memory in bytes.

Note this is not the size of the referenced chunks if they were actually loaded into memory, this is only the size of the pointers to the chunk locations. If you were to load the data into memory it would be ~1e6x larger for 1MB chunks.

In-memory (loadable) variables are included in the total using xarray's normal .nbytes method.

rename_paths ¶

rename_paths(new: str | Callable[[str], str]) -> Dataset

Rename paths to chunks in every ManifestArray in this dataset.

Accepts either a string, in which case this new path will be used for all chunks, or a function which accepts the old path and returns the new path.

Parameters:

new (str | Callable[[str], str]) –

New path to use for all chunks, either as a string, or as a function which accepts and returns strings.

Returns:

Dataset –

See Also

virtualizarr.ManifestArray.rename_paths

virtualizarr.ChunkManifest.rename_paths

Examples:

Rename paths to reflect moving the referenced files from local storage to an S3 bucket.

>>> def local_to_s3_url(old_local_path: str) -> str:
...     from pathlib import Path
...
...     new_s3_bucket_url = "http://s3.amazonaws.com/my_bucket/"
...
...     filename = Path(old_local_path).name
...     return str(new_s3_bucket_url / filename)
>>>
>>> ds.virtualize.rename_paths(local_to_s3_url)

to_icechunk ¶

to_icechunk(
    store: IcechunkStore,
    *,
    group: str | None = None,
    append_dim: str | None = None,
    last_updated_at: datetime | None = None,
) -> None

Write an xarray dataset to an Icechunk store.

Any variables backed by ManifestArray objects will be be written as virtual references. Any other variables will be loaded into memory before their binary chunk data is written into the store.

If append_dim is provided, the virtual dataset will be appended to the existing IcechunkStore along the append_dim dimension.

If last_updated_at is provided, it will be used as a checksum for any virtual chunks written to the store with this operation. At read time, if any of the virtual chunks have been updated since this provided datetime, an error will be raised. This protects against reading outdated virtual chunks that have been updated since the last read. When not provided, no check is performed. This value is stored in Icechunk with seconds precision, so be sure to take that into account when providing this value.

Parameters:

store (IcechunkStore) –

Store to write dataset into.
group (str | None, default: None ) –

Path of the group to write the dataset into (default: the root group).
append_dim (str | None, default: None ) –

Dimension along which to append the virtual dataset.
last_updated_at (datetime | None, default: None ) –

Datetime to use as a checksum for any virtual chunks written to the store with this operation. When not provided, no check is performed.

Raises:

ValueError –

If the store is read-only.

Examples:

To ensure an error is raised if the files containing referenced virtual chunks are modified at any time from now on, pass the current time to last_updated_at.

>>> from datetime import datetime
>>> vds.virtualize.to_icechunk(
...     icechunkstore,
...     last_updated_at=datetime.now(),
... )

to_kerchunk ¶

to_kerchunk(filepath: None, format: Literal['dict']) -> KerchunkStoreRefs

to_kerchunk(filepath: str | Path, format: Literal['json']) -> None

to_kerchunk(
    filepath: str | Path,
    format: Literal["parquet"],
    record_size: int = 100000,
    categorical_threshold: int = 10,
) -> None

to_kerchunk(
    filepath: str | Path | None = None,
    format: Literal["dict", "json", "parquet"] = "dict",
    record_size: int = 100000,
    categorical_threshold: int = 10,
) -> KerchunkStoreRefs | None

Serialize all virtualized arrays in this xarray dataset into the kerchunk references format.

Parameters:

filepath (str | Path | None, default: None ) –

File path to write kerchunk references into. Not required if format is 'dict'.
format (Literal['dict', 'json', 'parquet'], default: 'dict' ) –

Format to serialize the kerchunk references as. If 'json' or 'parquet' then the 'filepath' argument is required.
record_size (int, default: 100000 ) –

Number of references to store in each reference file (default 100,000). Bigger values mean fewer read requests but larger memory footprint. Only available when format is 'parquet'.
categorical_threshold (int, default: 10 ) –

Encode urls as pandas.Categorical to reduce memory footprint if the ratio of the number of unique urls to total number of refs for each variable is greater than or equal to this number (default 10). Only available when format is 'parquet'.

References

fsspec.github.io/kerchunk/spec.html

virtualizarr.accessor.VirtualiZarrDataTreeAccessor ¶

Xarray accessor for writing out virtual datatrees to disk.

Methods on this object are called via dt.virtualize.{method}.

to_icechunk ¶

to_icechunk(
    store: IcechunkStore,
    *,
    write_inherited_coords: bool = False,
    last_updated_at: datetime | None = None,
) -> None

Write an xarray DataTree to an Icechunk store.

Any variables backed by ManifestArray objects will be be written as virtual references. Any other variables will be loaded into memory before their binary chunk data is written into the store.

If last_updated_at is provided, it will be used as a checksum for any virtual chunks written to the store with this operation. At read time, if any of the virtual chunks have been updated since this provided datetime, an error will be raised. This protects against reading outdated virtual chunks that have been updated since the last read. When not provided, no check is performed. This value is stored in Icechunk with seconds precision, so be sure to take that into account when providing this value.

Parameters:

store (IcechunkStore) –

Store to write dataset into.
write_inherited_coords (bool, default: False ) –

If True, replicate inherited coordinates on all descendant nodes. Otherwise, only write coordinates at the level at which they are originally defined. This saves disk space, but requires opening the full tree to load inherited coordinates.
last_updated_at (datetime | None, default: None ) –

Datetime to use as a checksum for any virtual chunks written to the store with this operation. When not provided, no check is performed.

Raises:

ValueError –

If the store is read-only.

Examples:

To ensure an error is raised if the files containing referenced virtual chunks are modified at any time from now on, pass the current time to last_updated_at.

>>> from datetime import datetime
>>> vdt.virtualize.to_icechunk(
...     icechunkstore,
...     last_updated_at=datetime.now(),
... )

Information¶

virtualizarr.accessor.VirtualiZarrDatasetAccessor.nbytes `property` ¶

nbytes: int

Size required to hold these references in memory in bytes.

Note this is not the size of the referenced chunks if they were actually loaded into memory, this is only the size of the pointers to the chunk locations. If you were to load the data into memory it would be ~1e6x larger for 1MB chunks.

In-memory (loadable) variables are included in the total using xarray's normal .nbytes method.

Rewriting¶

virtualizarr.accessor.VirtualiZarrDatasetAccessor.rename_paths ¶

rename_paths(new: str | Callable[[str], str]) -> Dataset

Rename paths to chunks in every ManifestArray in this dataset.

Accepts either a string, in which case this new path will be used for all chunks, or a function which accepts the old path and returns the new path.

Parameters:

new (str | Callable[[str], str]) –

New path to use for all chunks, either as a string, or as a function which accepts and returns strings.

Returns:

Dataset –

See Also

virtualizarr.ManifestArray.rename_paths

virtualizarr.ChunkManifest.rename_paths

Examples:

Rename paths to reflect moving the referenced files from local storage to an S3 bucket.

>>> def local_to_s3_url(old_local_path: str) -> str:
...     from pathlib import Path
...
...     new_s3_bucket_url = "http://s3.amazonaws.com/my_bucket/"
...
...     filename = Path(old_local_path).name
...     return str(new_s3_bucket_url / filename)
>>>
>>> ds.virtualize.rename_paths(local_to_s3_url)

Developer API¶

If you want to write a new parser to create virtual references pointing to a custom file format, you will need to use VirtualiZarr's internal classes. See the page on custom parsers for more information.

Manifests¶

VirtualiZarr uses these classes to store virtual references internally. See the page on data structures for more information.

virtualizarr.manifests.ChunkManifest ¶

In-memory representation of a single Zarr chunk manifest.

Stores the manifest internally as numpy arrays, so the most efficient way to create this object is via the .from_arrays constructor classmethod.

The manifest can be converted to or from a dictionary which looks like this

{
    "0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
    "0.0.1": {"path": "s3://bucket/foo.nc", "offset": 200, "length": 100},
    "0.1.0": {"path": "s3://bucket/foo.nc", "offset": 300, "length": 100},
    "0.1.1": {"path": "s3://bucket/foo.nc", "offset": 400, "length": 100},
}

using the .init() and .dict() methods, so users of this class can think of the manifest as if it were a dict mapping zarr chunk keys to byte ranges.

(See the chunk manifest SPEC proposal in zarr-developers/zarr-specs#287.)

Validation is done when this object is instantiated, and this class is immutable, so it's not possible to have a ChunkManifest object that does not represent a valid grid of chunks.

nbytes `property` ¶

nbytes: int

Size required to hold these references in memory in bytes.

Note this is not the size of the referenced chunks if they were actually loaded into memory, this is only the size of the pointers to the chunk locations. If you were to load the data into memory it would be ~1e6x larger for 1MB chunks.

ndim_chunk_grid `property` ¶

ndim_chunk_grid: int

Number of dimensions in the chunk grid.

Not the same as the dimension of an array backed by this chunk manifest.

shape_chunk_grid `property` ¶

shape_chunk_grid: tuple[int, ...]

Number of separate chunks along each dimension.

Not the same as the shape of an array backed by this chunk manifest.

eq ¶

__eq__(other: Any) -> bool

Two manifests are equal if all of their entries are identical.

init ¶

__init__(entries: dict, shape: tuple[int, ...] | None = None) -> None

Create a ChunkManifest from a dictionary mapping zarr chunk keys to byte ranges.

Parameters:

entries (dict) –

Chunk keys and byte range information, as a dictionary of the form

{
    "0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
    "0.0.1": {"path": "s3://bucket/foo.nc", "offset": 200, "length": 100},
    "0.1.0": {"path": "s3://bucket/foo.nc", "offset": 300, "length": 100},
    "0.1.1": {"path": "s3://bucket/foo.nc", "offset": 400, "length": 100},
}

dict ¶

dict() -> ChunkDict

Convert the entire manifest to a nested dictionary.

The returned dict will be of the form

{
    "0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
    "0.0.1": {"path": "s3://bucket/foo.nc", "offset": 200, "length": 100},
    "0.1.0": {"path": "s3://bucket/foo.nc", "offset": 300, "length": 100},
    "0.1.1": {"path": "s3://bucket/foo.nc", "offset": 400, "length": 100},
}

Entries whose path is an empty string will be interpreted as missing chunks and omitted from the dictionary.

from_arrays `classmethod` ¶

from_arrays(
    *,
    paths: ndarray[Any, StringDType],
    offsets: ndarray[Any, dtype[uint64]],
    lengths: ndarray[Any, dtype[uint64]],
    validate_paths: bool = True,
) -> ChunkManifest

Create manifest directly from numpy arrays containing the path and byte range information.

Useful if you want to avoid the memory overhead of creating an intermediate dictionary first, as these 3 arrays are what will be used internally to store the references.

Parameters:

paths (ndarray[Any, StringDType]) –

Array containing the paths to the chunks
offsets (ndarray[Any, dtype[uint64]]) –

Array containing the byte offsets of the chunks
lengths (ndarray[Any, dtype[uint64]]) –

Array containing the byte lengths of the chunks
validate_paths (bool, default: True ) –

Check that entries in the manifest are valid paths (e.g. that local paths are absolute not relative). Set to False to skip validation for performance reasons.

rename_paths ¶

rename_paths(new: str | Callable[[str], str]) -> ChunkManifest

Rename paths to chunks in this manifest.

Accepts either a string, in which case this new path will be used for all chunks, or a function which accepts the old path and returns the new path.

Parameters:

new (str | Callable[[str], str]) –

New path to use for all chunks, either as a string, or as a function which accepts and returns strings.

Returns:

manifest –

See Also

ManifestArray.rename_paths

Examples:

Rename paths to reflect moving the referenced files from local storage to an S3 bucket.

>>> def local_to_s3_url(old_local_path: str) -> str:
...     from pathlib import Path
...
...     new_s3_bucket_url = "http://s3.amazonaws.com/my_bucket/"
...
...     filename = Path(old_local_path).name
...     return str(new_s3_bucket_url / filename)
>>>
>>> manifest.rename_paths(local_to_s3_url)

virtualizarr.manifests.ManifestArray ¶

Virtualized array representation of the chunk data in a single Zarr Array.

Supports concatenation / stacking, but only if the two arrays to be concatenated have the same codecs.

Cannot be directly altered.

Implements subset of the array API standard such that it can be wrapped by xarray. Doesn't store the zarr array name, zattrs or ARRAY_DIMENSIONS, as instead those can be stored on a wrapping xarray object.

chunks `property` ¶

chunks: tuple[int, ...]

Individual chunk size by number of elements.

nbytes_virtual `property` ¶

nbytes_virtual: int

Size required to hold these references in memory in bytes.

Note this is not the size of the referenced array if it were actually loaded into memory, this is only the size of the pointers to the chunk locations. If you were to load the data into memory it would be ~1e6x larger for 1MB chunks.

shape `property` ¶

shape: tuple[int, ...]

Array shape by number of elements along each dimension.

__array_function__ ¶

__array_function__(func, types, args, kwargs) -> Any

Hook to teach this class what to do if np.concat etc. is called on it.

Use this instead of array_namespace so that we don't make promises we can't keep.

__array_ufunc__ ¶

__array_ufunc__(ufunc, method, *inputs, **kwargs) -> Any

We have to define this in order to convince xarray that this class is a duckarray, even though we will never support ufuncs.

eq ¶

__eq__(other: Union[int, float, bool, ndarray, ManifestArray]) -> ndarray

Element-wise equality checking.

Returns a numpy array of booleans.

getitem ¶

__getitem__(
    key: Union[
        int,
        slice,
        EllipsisType,
        None,
        tuple[Union[int, slice, EllipsisType, None, ndarray], ...],
        ndarray,
    ],
) -> ManifestArray

Only supports extremely limited indexing.

Only here because xarray will apparently attempt to index into its lazy indexing classes even if the operation would be a no-op anyway.

init ¶

__init__(
    metadata: ArrayV3Metadata | dict, chunkmanifest: dict | ChunkManifest
) -> None

Create a ManifestArray directly from the metadata of a zarr array and the manifest of chunks.

Parameters:

metadata (dict or ArrayV3Metadata) –
chunkmanifest (dict or ChunkManifest) –

astype ¶

astype(dtype: dtype, /, *, copy: bool = True) -> ManifestArray

Cannot change the dtype, but needed because xarray will call this even when it's a no-op.

rename_paths ¶

rename_paths(new: str | Callable[[str], str]) -> ManifestArray

Rename paths to chunks in this array's manifest.

Accepts either a string, in which case this new path will be used for all chunks, or a function which accepts the old path and returns the new path.

Parameters:

new (str | Callable[[str], str]) –

New path to use for all chunks, either as a string, or as a function which accepts and returns strings.

Returns:

ManifestArray –

See Also

ChunkManifest.rename_paths

Examples:

Rename paths to reflect moving the referenced files from local storage to an S3 bucket.

>>> def local_to_s3_url(old_local_path: str) -> str:
...     from pathlib import Path
...
...     new_s3_bucket_url = "http://s3.amazonaws.com/my_bucket/"
...
...     filename = Path(old_local_path).name
...     return str(new_s3_bucket_url / filename)
>>>
>>> marr.rename_paths(local_to_s3_url)

to_virtual_variable ¶

to_virtual_variable() -> Variable

Create a "virtual" xarray.Variable containing the contents of one zarr array.

The returned variable will be "virtual", i.e. it will wrap a single ManifestArray object.

virtualizarr.manifests.ManifestGroup ¶

Bases: Mapping[str, 'ManifestArray | ManifestGroup']

Immutable representation of a single virtual zarr group.

arrays `property` ¶

arrays: dict[str, ManifestArray]

ManifestArrays contained in this group.

groups `property` ¶

groups: dict[str, 'ManifestGroup']

Subgroups contained in this group.

metadata `property` ¶

metadata: GroupMetadata

Zarr group metadata.

getitem ¶

__getitem__(path: str) -> 'ManifestArray | ManifestGroup'

Obtain a group member.

init ¶

__init__(
    arrays: Mapping[str, ManifestArray] | None = None,
    groups: Mapping[str, "ManifestGroup"] | None = None,
    attributes: dict | None = None,
) -> None

Create a ManifestGroup containing ManifestArrays and/or sub-groups, as well as any group-level metadata.

Parameters:

arrays (Mapping[str, ManifestArray], default: None ) –

ManifestArray objects to represent virtual zarr arrays.
groups (Mapping[str, ManifestGroup], default: None ) –

ManifestGroup objects to represent virtual zarr subgroups.
attributes (dict, default: None ) –

Zarr attributes to add as zarr group metadata.

to_virtual_dataset ¶

to_virtual_dataset() -> Dataset

Create a "virtual" xarray.Dataset containing the contents of one zarr group.

All variables in the returned Dataset will be "virtual", i.e. they will wrap ManifestArray objects.

virtualizarr.manifests.ManifestStore ¶

Bases: Store

A read-only Zarr store that uses obstore to read data from inside arbitrary files on AWS, GCP, Azure, or a local filesystem.

The requests from the Zarr API are redirected using the :class:virtualizarr.manifests.ManifestGroup containing multiple :class:virtualizarr.manifests.ManifestArray, allowing for virtually interfacing with underlying data in other file formats.

Parameters:

group (ManifestGroup) –

Root group of the store. Contains group metadata, ManifestArrays, and any subgroups.
store_registry (ObjectStoreRegistry, default: None ) –

ObjectStoreRegistry that maps the URL scheme and netloc to ObjectStore instances, allowing ManifestStores to read from different ObjectStore instances.

Warnings

ManifestStore is experimental and subject to API changes without notice. Please raise an issue with any comments/concerns about the store.

Notes

Modified from zarr-developers/zarr-python!1661

init ¶

__init__(
    group: ManifestGroup, *, store_registry: ObjectStoreRegistry | None = None
) -> None

Instantiate a new ManifestStore.

Parameters:

group (ManifestGroup) –

Manifest Group containing Group metadata and mapping variable names to ManifestArrays
store_registry (ObjectStoreRegistry | None, default: None ) –

A registry mapping the URL scheme and netloc to ObjectStore instances, allowing ManifestStores to read from different ObjectStore instances.

to_virtual_dataset ¶

to_virtual_dataset(
    group="",
    loadable_variables: Iterable[str] | None = None,
    decode_times: bool | None = None,
    indexes: Mapping[str, Index] | None = None,
) -> "xr.Dataset"

Create a "virtual" xarray dataset containing the contents of one zarr group.

Will ignore the contents of any other groups in the store.

Requires xarray.

Parameters:

group (str, default: '' ) –
loadable_variables (Iterable[str], default: None ) –

Returns:

vds ( Dataset ) –

Array API¶

VirtualiZarr's virtualizarr.manifests.ManifestArray objects support a limited subset of the Python Array API standard in virtualizarr.manifests.array_api.

virtualizarr.manifests.array_api.concatenate ¶

concatenate(
    arrays: tuple[ManifestArray, ...] | list[ManifestArray],
    /,
    *,
    axis: int | None = 0,
) -> ManifestArray

Concatenate ManifestArrays by merging their chunk manifests.

The signature of this function is array API compliant, so that it can be called by xarray.concat.

virtualizarr.manifests.array_api.stack ¶

stack(
    arrays: tuple[ManifestArray, ...] | list[ManifestArray], /, *, axis: int = 0
) -> ManifestArray

Stack ManifestArrays by merging their chunk manifests.

The signature of this function is array API compliant, so that it can be called by xarray.stack.

virtualizarr.manifests.array_api.expand_dims ¶

expand_dims(x: ManifestArray, /, *, axis: int = 0) -> ManifestArray

Expands the shape of an array by inserting a new axis (dimension) of size one at the position specified by axis.

virtualizarr.manifests.array_api.broadcast_to ¶

broadcast_to(x: ManifestArray, /, shape: tuple[int, ...]) -> ManifestArray

Broadcasts a ManifestArray to a specified shape, by either adjusting chunk keys or copying chunk manifest entries.

Parser typing protocol¶

All custom parsers must follow the virtualizarr.parsers.typing.Parser typing protocol.

virtualizarr.parsers.typing.Parser ¶

Bases: Protocol

Parallelization¶

Parallelizing virtual reference generation can be done using a number of parallel execution frameworks. Advanced users may want to call one of these executors directly. See the docs page on Scaling.

virtualizarr.parallel.SerialExecutor ¶

Bases: Executor

A custom Executor that runs tasks sequentially, mimicking the concurrent.futures.Executor interface. Useful as a default and for debugging.

map ¶

map(
    fn: Callable[..., T],
    *iterables: Iterable[Any],
    timeout: float | None = None,
    chunksize: int = 1,
) -> Iterator[T]

Execute a function over an iterable sequentially.

Parameters:

fn (Callable[..., T]) –

Function to apply to each item
*iterables (Iterable[Any], default: () ) –

Iterables to process
timeout (float | None, default: None ) –

Optional timeout (ignored in serial execution)

Returns:

Generator of results –

shutdown ¶

shutdown(wait: bool = True, *, cancel_futures: bool = False) -> None

Shutdown the executor.

Parameters:

wait (bool, default: True ) –

Whether to wait for pending futures (always True for serial executor)

submit ¶

submit(fn: Callable[..., T], /, *args: Any, **kwargs: Any) -> Future[T]

Submit a callable to be executed.

Unlike parallel executors, this runs the task immediately and sequentially.

Parameters:

fn (Callable[..., T]) –

The callable to execute
*args (Any, default: () ) –

Positional arguments for the callable
**kwargs (Any, default: {} ) –

Keyword arguments for the callable

Returns:

A Future representing the result of the execution –

virtualizarr.parallel.DaskDelayedExecutor ¶

Bases: Executor

An Executor that uses dask.delayed for parallel computation.

This executor mimics the concurrent.futures.Executor interface but uses Dask's delayed computation model.

init ¶

__init__() -> None

Initialize the Dask Delayed Executor.

map ¶

map(
    fn: Callable[..., T],
    *iterables: Iterable[Any],
    timeout: float | None = None,
    chunksize: int = 1,
) -> Iterator[T]

Apply a function to an iterable using dask.delayed.

Parameters:

fn (Callable[..., T]) –

Function to apply to each item
*iterables (Iterable[Any], default: () ) –

Iterables to process
timeout (float | None, default: None ) –

Optional timeout (ignored in serial execution)

Returns:

Generator of results –

shutdown ¶

shutdown(wait: bool = True, *, cancel_futures: bool = False) -> None

Shutdown the executor

Parameters:

wait (bool, default: True ) –

Whether to wait for pending futures (always True for serial executor))

submit ¶

submit(fn: Callable[..., T], /, *args: Any, **kwargs: Any) -> Future[T]

Submit a task to be computed with dask.delayed.

Parameters:

fn (Callable[..., T]) –

The callable to execute
*args (Any, default: () ) –

Positional arguments for the callable
**kwargs (Any, default: {} ) –

Keyword arguments for the callable

Returns:

A Future representing the result of the execution –

virtualizarr.parallel.LithopsEagerFunctionExecutor ¶

Bases: Executor

Lithops-based function executor which follows the concurrent.futures.Executor API.

Only required because lithops doesn't follow the concurrent.futures.Executor API, see lithops-cloud/lithops#1427.

map ¶

map(
    fn: Callable[..., T],
    *iterables: Iterable[Any],
    timeout: float | None = None,
    chunksize: int = 1,
) -> Iterator[T]

Apply a function to an iterable using lithops.

Only needed because lithops.FunctionExecutor.map returns futures, unlike concurrent.futures.Executor.map.

Parameters:

fn (Callable[..., T]) –

Function to apply to each item
*iterables (Iterable[Any], default: () ) –

Iterables to process
timeout (float | None, default: None ) –

Optional timeout (ignored in serial execution)

Returns:

Generator of results –

shutdown ¶

shutdown(wait: bool = True, *, cancel_futures: bool = False) -> None

Shutdown the executor.

Parameters:

wait (bool, default: True ) –

Whether to wait for pending futures.

submit ¶

submit(fn: Callable[..., T], /, *args: Any, **kwargs: Any) -> Future[T]

Submit a task to be computed using lithops.

Parameters:

fn (Callable[..., T]) –

The callable to execute
*args (Any, default: () ) –

Positional arguments for the callable
**kwargs (Any, default: {} ) –

Keyword arguments for the callable

Returns:

A concurrent.futures.Future representing the result of the execution –

API reference¶

User API¶

Reading¶

virtualizarr.open_virtual_dataset ¶

virtualizarr.open_virtual_mfdataset ¶

Parsers¶

virtualizarr.parsers.DMRPPParser ¶

__call__ ¶

__init__ ¶

virtualizarr.parsers.FITSParser ¶

__call__ ¶

__init__ ¶

virtualizarr.parsers.HDFParser ¶

virtualizarr.parsers.NetCDF3Parser ¶

__call__ ¶

__init__ ¶

virtualizarr.parsers.KerchunkJSONParser ¶

__call__ ¶

__init__ ¶

virtualizarr.parsers.KerchunkParquetParser ¶

__call__ ¶

__init__ ¶

virtualizarr.parsers.ZarrParser ¶

__call__ ¶

__init__ ¶

Serialization¶

virtualizarr.accessor.VirtualiZarrDatasetAccessor ¶

nbytes property ¶

rename_paths ¶

to_icechunk ¶

to_kerchunk ¶

virtualizarr.accessor.VirtualiZarrDataTreeAccessor ¶

to_icechunk ¶

Information¶

virtualizarr.accessor.VirtualiZarrDatasetAccessor.nbytes property ¶

Rewriting¶

virtualizarr.accessor.VirtualiZarrDatasetAccessor.rename_paths ¶

Developer API¶

Manifests¶

virtualizarr.manifests.ChunkManifest ¶

nbytes property ¶

ndim_chunk_grid property ¶

shape_chunk_grid property ¶

__eq__ ¶

__init__ ¶

dict ¶

from_arrays classmethod ¶

rename_paths ¶

virtualizarr.manifests.ManifestArray ¶

chunks property ¶

nbytes_virtual property ¶

shape property ¶

__array_function__ ¶

__array_ufunc__ ¶

__eq__ ¶

__getitem__ ¶

__init__ ¶

astype ¶

rename_paths ¶

to_virtual_variable ¶

virtualizarr.manifests.ManifestGroup ¶

arrays property ¶

groups property ¶

metadata property ¶

__getitem__ ¶

__init__ ¶

to_virtual_dataset ¶

virtualizarr.manifests.ManifestStore ¶

__init__ ¶

to_virtual_dataset ¶

Array API¶

virtualizarr.manifests.array_api.concatenate ¶

virtualizarr.manifests.array_api.stack ¶

virtualizarr.manifests.array_api.expand_dims ¶

virtualizarr.manifests.array_api.broadcast_to ¶

Parser typing protocol¶

virtualizarr.parsers.typing.Parser ¶

Parallelization¶

virtualizarr.parallel.SerialExecutor ¶

map ¶

call ¶

init ¶

call ¶

init ¶

call ¶

init ¶

call ¶

init ¶

call ¶

init ¶

call ¶

init ¶

nbytes `property` ¶

virtualizarr.accessor.VirtualiZarrDatasetAccessor.nbytes `property` ¶

nbytes `property` ¶

ndim_chunk_grid `property` ¶

shape_chunk_grid `property` ¶

eq ¶

init ¶

from_arrays `classmethod` ¶

chunks `property` ¶

nbytes_virtual `property` ¶

shape `property` ¶

eq ¶

getitem ¶

init ¶

arrays `property` ¶

groups `property` ¶

metadata `property` ¶

getitem ¶

init ¶

init ¶

init ¶