API reference¶
VirtualiZarr has a small API surface, because most of the complexity is handled by xarray functions like xarray.concat
and xarray.merge
.
Users can use xarray for every step apart from reading and serializing virtual references.
User API¶
Reading¶
virtualizarr.open_virtual_dataset ¶
open_virtual_dataset(
file_url: str,
object_store: ObjectStore,
parser: Parser,
drop_variables: Iterable[str] | None = None,
loadable_variables: Iterable[str] | None = None,
decode_times: bool | None = None,
cftime_variables: Iterable[str] | None = None,
indexes: Mapping[str, Index] | None = None,
) -> Dataset
virtualizarr.open_virtual_mfdataset ¶
open_virtual_mfdataset(
paths: str
| PathLike
| Sequence[str | PathLike]
| NestedSequence[str | PathLike],
object_store: ObjectStore,
parser: Parser,
concat_dim: str
| DataArray
| Index
| Sequence[str]
| Sequence[DataArray]
| Sequence[Index]
| None = None,
compat: "CompatOptions" = "no_conflicts",
preprocess: Callable[[Dataset], Dataset] | None = None,
data_vars: Literal["all", "minimal", "different"] | list[str] = "all",
coords="different",
combine: Literal["by_coords", "nested"] = "by_coords",
parallel: Literal["dask", "lithops", False] | type[Executor] = False,
join: "JoinOptions" = "outer",
attrs_file: str | PathLike | None = None,
combine_attrs: "CombineAttrsOptions" = "override",
**kwargs,
) -> Dataset
Open multiple files as a single virtual dataset.
This function is explicitly modelled after xarray.open_mfdataset
, and works in the same way.
If combine='by_coords' then the function combine_by_coords
is used to combine
the datasets into one before returning the result, and if combine='nested' then
combine_nested
is used. The filepaths must be structured according to which
combining function is used, the details of which are given in the documentation for
combine_by_coords
and combine_nested
. By default combine='by_coords'
will be used. Global attributes from the attrs_file
are used
for the combined dataset.
Parameters:
-
paths
(str | PathLike | Sequence[str | PathLike] | NestedSequence[str | PathLike]
) –Same as in xarray.open_mfdataset
-
concat_dim
(str | DataArray | Index | Sequence[str] | Sequence[DataArray] | Sequence[Index] | None
, default:None
) –Same as in xarray.open_mfdataset
-
compat
('CompatOptions'
, default:'no_conflicts'
) –Same as in xarray.open_mfdataset
-
preprocess
(Callable[[Dataset], Dataset] | None
, default:None
) –Same as in xarray.open_mfdataset
-
data_vars
(Literal['all', 'minimal', 'different'] | list[str]
, default:'all'
) –Same as in xarray.open_mfdataset
-
coords
–Same as in xarray.open_mfdataset
-
combine
(Literal['by_coords', 'nested']
, default:'by_coords'
) –Same as in xarray.open_mfdataset
-
parallel
("dask", "lithops", False, or type of subclass of ``concurrent.futures.Executor``
, default:False
) –Specify whether the open and preprocess steps of this function will be performed in parallel using lithops, dask.delayed, or any executor compatible with the
concurrent.futures
interface, or in serial. Default is False, which will execute these steps in serial. -
join
('JoinOptions'
, default:'outer'
) –Same as in xarray.open_mfdataset
-
attrs_file
(str | PathLike | None
, default:None
) –Same as in xarray.open_mfdataset
-
combine_attrs
('CombineAttrsOptions'
, default:'override'
) –Same as in xarray.open_mfdataset
-
**kwargs
(optional
, default:{}
) –Additional arguments passed on to virtualizarr.open_virtual_dataset. For an overview of some of the possible options, see the documentation of virtualizarr.open_virtual_dataset.
Returns:
-
Dataset
–
Notes
The results of opening each virtual dataset in parallel are sent back to the client process, so must not be too large. See the docs page on Scaling.
Parsers¶
Each parser understands how to read a specific file format, and a parser must be passed to virtualizarr.open_virtual_dataset
.
virtualizarr.parsers.DMRPPParser ¶
__call__ ¶
__call__(file_url: str, object_store: ObjectStore) -> ManifestStore
Parse the metadata and byte offsets from a given file to product a VirtualiZarr ManifestStore.
Parameters:
-
file_url
(str
) –The URI or path to the input file (e.g., "s3://bucket/file.dmrpp").
-
object_store
(ObjectStore
) –An obstore ObjectStore instance for accessing the file specified in the
file_url
parameter.
Returns:
-
ManifestStore
–A ManifestStore that provides a Zarr representation of the parsed file.
__init__ ¶
Instantiate a parser with parser-specific parameters that can be used in the call method.
Parameters:
virtualizarr.parsers.FITSParser ¶
__call__ ¶
__call__(file_url: str, object_store: ObjectStore) -> ManifestStore
Parse the contents of a FITS file to produce a ManifestStore.
Parameters:
-
file_url
(str
) –The URI or path to the input file (e.g., "s3://bucket/file.fits").
-
object_store
(ObjectStore
) –An obstore ObjectStore instance for accessing the file specified in the
file_url
parameter.
Returns:
-
ManifestStore
–A ManifestStore which provides a Zarr representation of the parsed file.
__init__ ¶
__init__(
group: str | None = None,
skip_variables: Iterable[str] | None = None,
reader_options: Optional[dict] = None,
)
Instantiate a parser with parser-specific parameters that can be used in the
__call__
method.
Parameters:
-
group
(str | None
, default:None
) –The group within the file to be used as the Zarr root group for the ManifestStore.
-
skip_variables
(Iterable[str] | None
, default:None
) –Variables in the file that will be ignored when creating the ManifestStore.
-
reader_options
(Optional[dict]
, default:None
) –Configuration options used internally for kerchunk's fsspec backend.
virtualizarr.parsers.HDFParser ¶
virtualizarr.parsers.NetCDF3Parser ¶
__call__ ¶
__call__(file_url: str, object_store: ObjectStore) -> ManifestStore
Parse the metadata and byte offsets from a given file to product a VirtualiZarr ManifestStore.
Parameters:
-
file_url
(str
) –The URI or path to the input file (e.g., "s3://bucket/file.nc").
-
object_store
(ObjectStore
) –An obstore ObjectStore instance for accessing the file specified in the
file_url
parameter.
Returns:
-
ManifestStore
–A ManifestStore that provides a Zarr representation of the parsed file.
__init__ ¶
__init__(
group: str | None = None,
skip_variables: Iterable[str] | None = None,
reader_options: dict | None = None,
)
Instantiate a parser with parser-specific parameters that can be used in the
__call__
method.
Parameters:
-
group
(str | None
, default:None
) –The group within the file to be used as the Zarr root group for the ManifestStore.
-
skip_variables
(Iterable[str] | None
, default:None
) –Variables in the file that will be ignored when creating the ManifestStore.
-
reader_options
(dict | None
, default:None
) –Configuration options used internally for the kerchunk's fsspec backend.
virtualizarr.parsers.KerchunkJSONParser ¶
__call__ ¶
__call__(file_url: str, object_store: ObjectStore) -> ManifestStore
Parse the metadata and byte offsets from a given file to produce a VirtualiZarr ManifestStore.
Parameters:
-
file_url
(str
) –The URI or path to the input file (e.g., "s3://bucket/kerchunk.json").
-
object_store
(ObjectStore
) –An obstore ObjectStore instance for accessing the file specified in the
file_url
parameter.
Returns:
-
ManifestStore
–A ManifestStore that provides a Zarr representation of the parsed file.
__init__ ¶
__init__(
group: str | None = None,
fs_root: str | None = None,
skip_variables: Iterable[str] | None = None,
store_registry: ObjectStoreRegistry | None = None,
)
Instantiate a parser with parser-specific parameters that can be used in the
__call__
method.
Parameters:
-
group
(str | None
, default:None
) –The group within the file to be used as the Zarr root group for the ManifestStore.
-
fs_root
(str | None
, default:None
) –The qualifier to be used for kerchunk references containing relative paths.
-
skip_variables
(Iterable[str] | None
, default:None
) –Variables in the file that will be ignored when creating the ManifestStore.
-
store_registry
(ObjectStoreRegistry | None
, default:None
) –A user defined ObjectStoreRegistry to be used for reading data for kerchunk references contain paths to multiple locations.
virtualizarr.parsers.KerchunkParquetParser ¶
__call__ ¶
__call__(file_url: str, object_store: ObjectStore) -> ManifestStore
Parse the metadata and byte offsets from a given file to product a VirtualiZarr ManifestStore.
Parameters:
-
file_url
(str
) –The URI or path to the input parquet directory (e.g., "s3://bucket/file.parq").
-
object_store
(ObjectStore
) –An obstore ObjectStore instance for accessing the file specified in the
file_url
parameter.
Returns:
-
ManifestStore
–A ManifestStore which provides a Zarr representation of the parsed file.
__init__ ¶
__init__(
group: str | None = None,
fs_root: str | None = None,
skip_variables: Iterable[str] | None = None,
reader_options: dict | None = None,
)
Instantiate a parser with parser-specific parameters that can be used in the
__call__
method.
Parameters:
-
group
(str | None
, default:None
) –The group within the file to be used as the Zarr root group for the ManifestStore.
-
fs_root
(str | None
, default:None
) –The qualifier to be used for kerchunk references containing relative paths.
-
skip_variables
(Iterable[str] | None
, default:None
) –Variables in the file that will be ignored when creating the ManifestStore.
-
reader_options
(dict | None
, default:None
) –Configuration options used internally for the fsspec backend.
virtualizarr.parsers.ZarrParser ¶
__call__ ¶
__call__(file_url: str, object_store: ObjectStore) -> ManifestStore
Parse the metadata and byte offsets from a given Zarr store to produce a VirtualiZarr ManifestStore.
Parameters:
-
file_url
(str
) –The URI or path to the input Zarr store (e.g., "s3://bucket/store.zarr").
-
object_store
(ObjectStore
) –An obstore ObjectStore instance for accessing the directory specified in the
file_url
parameter.
Returns:
-
ManifestStore
(A ManifestStore which provides a Zarr representation of the parsed file.
) –
__init__ ¶
Instantiate a parser with parser-specific parameters that can be used in the
__call__
method.
Parameters:
-
group
(str | None
, default:None
) –The group within the file to be used as the Zarr root group for the ManifestStore (default: the file's root group).
-
skip_variables
(Iterable[str] | None
, default:None
) –Variables in the file that will be ignored when creating the ManifestStore (default:
None
, do not ignore any variables).
Serialization¶
virtualizarr.accessor.VirtualiZarrDatasetAccessor ¶
Xarray accessor for writing out virtual datasets to disk.
Methods on this object are called via ds.virtualize.{method}
.
nbytes
property
¶
nbytes: int
Size required to hold these references in memory in bytes.
Note this is not the size of the referenced chunks if they were actually loaded into memory, this is only the size of the pointers to the chunk locations. If you were to load the data into memory it would be ~1e6x larger for 1MB chunks.
In-memory (loadable) variables are included in the total using xarray's normal .nbytes
method.
rename_paths ¶
Rename paths to chunks in every ManifestArray in this dataset.
Accepts either a string, in which case this new path will be used for all chunks, or a function which accepts the old path and returns the new path.
Parameters:
-
new
(str | Callable[[str], str]
) –New path to use for all chunks, either as a string, or as a function which accepts and returns strings.
Returns:
-
Dataset
–
See Also
virtualizarr.ManifestArray.rename_paths
virtualizarr.ChunkManifest.rename_paths
Examples:
Rename paths to reflect moving the referenced files from local storage to an S3 bucket.
>>> def local_to_s3_url(old_local_path: str) -> str:
... from pathlib import Path
...
... new_s3_bucket_url = "http://s3.amazonaws.com/my_bucket/"
...
... filename = Path(old_local_path).name
... return str(new_s3_bucket_url / filename)
>>>
>>> ds.virtualize.rename_paths(local_to_s3_url)
to_icechunk ¶
to_icechunk(
store: IcechunkStore,
*,
group: str | None = None,
append_dim: str | None = None,
last_updated_at: datetime | None = None,
) -> None
Write an xarray dataset to an Icechunk store.
Any variables backed by ManifestArray objects will be be written as virtual references. Any other variables will be loaded into memory before their binary chunk data is written into the store.
If append_dim
is provided, the virtual dataset will be appended to the
existing IcechunkStore along the append_dim
dimension.
If last_updated_at
is provided, it will be used as a checksum for any virtual
chunks written to the store with this operation. At read time, if any of the
virtual chunks have been updated since this provided datetime, an error will be
raised. This protects against reading outdated virtual chunks that have been
updated since the last read. When not provided, no check is performed. This
value is stored in Icechunk with seconds precision, so be sure to take that into
account when providing this value.
Parameters:
-
store
(IcechunkStore
) –Store to write dataset into.
-
group
(str | None
, default:None
) –Path of the group to write the dataset into (default: the root group).
-
append_dim
(str | None
, default:None
) –Dimension along which to append the virtual dataset.
-
last_updated_at
(datetime | None
, default:None
) –Datetime to use as a checksum for any virtual chunks written to the store with this operation. When not provided, no check is performed.
Raises:
-
ValueError
–If the store is read-only.
Examples:
To ensure an error is raised if the files containing referenced virtual chunks
are modified at any time from now on, pass the current time to
last_updated_at
.
>>> from datetime import datetime
>>> vds.virtualize.to_icechunk(
... icechunkstore,
... last_updated_at=datetime.now(),
... )
to_kerchunk ¶
to_kerchunk(filepath: None, format: Literal['dict']) -> KerchunkStoreRefs
to_kerchunk(
filepath: str | Path | None = None,
format: Literal["dict", "json", "parquet"] = "dict",
record_size: int = 100000,
categorical_threshold: int = 10,
) -> KerchunkStoreRefs | None
Serialize all virtualized arrays in this xarray dataset into the kerchunk references format.
Parameters:
-
filepath
(str | Path | None
, default:None
) –File path to write kerchunk references into. Not required if format is 'dict'.
-
format
(Literal['dict', 'json', 'parquet']
, default:'dict'
) –Format to serialize the kerchunk references as. If 'json' or 'parquet' then the 'filepath' argument is required.
-
record_size
(int
, default:100000
) –Number of references to store in each reference file (default 100,000). Bigger values mean fewer read requests but larger memory footprint. Only available when
format
is 'parquet'. -
categorical_threshold
(int
, default:10
) –Encode urls as pandas.Categorical to reduce memory footprint if the ratio of the number of unique urls to total number of refs for each variable is greater than or equal to this number (default 10). Only available when
format
is 'parquet'.
References
virtualizarr.accessor.VirtualiZarrDataTreeAccessor ¶
Xarray accessor for writing out virtual datatrees to disk.
Methods on this object are called via dt.virtualize.{method}
.
to_icechunk ¶
to_icechunk(
store: IcechunkStore,
*,
write_inherited_coords: bool = False,
last_updated_at: datetime | None = None,
) -> None
Write an xarray DataTree to an Icechunk store.
Any variables backed by ManifestArray objects will be be written as virtual references. Any other variables will be loaded into memory before their binary chunk data is written into the store.
If last_updated_at
is provided, it will be used as a checksum for any
virtual chunks written to the store with this operation. At read time, if any
of the virtual chunks have been updated since this provided datetime, an error
will be raised. This protects against reading outdated virtual chunks that have
been updated since the last read. When not provided, no check is performed.
This value is stored in Icechunk with seconds precision, so be sure to take that
into account when providing this value.
Parameters:
-
store
(IcechunkStore
) –Store to write dataset into.
-
write_inherited_coords
(bool
, default:False
) –If
True
, replicate inherited coordinates on all descendant nodes. Otherwise, only write coordinates at the level at which they are originally defined. This saves disk space, but requires opening the full tree to load inherited coordinates. -
last_updated_at
(datetime | None
, default:None
) –Datetime to use as a checksum for any virtual chunks written to the store with this operation. When not provided, no check is performed.
Raises:
-
ValueError
–If the store is read-only.
Examples:
To ensure an error is raised if the files containing referenced virtual chunks
are modified at any time from now on, pass the current time to
last_updated_at
.
>>> from datetime import datetime
>>> vdt.virtualize.to_icechunk(
... icechunkstore,
... last_updated_at=datetime.now(),
... )
Information¶
virtualizarr.accessor.VirtualiZarrDatasetAccessor.nbytes
property
¶
nbytes: int
Size required to hold these references in memory in bytes.
Note this is not the size of the referenced chunks if they were actually loaded into memory, this is only the size of the pointers to the chunk locations. If you were to load the data into memory it would be ~1e6x larger for 1MB chunks.
In-memory (loadable) variables are included in the total using xarray's normal .nbytes
method.
Rewriting¶
virtualizarr.accessor.VirtualiZarrDatasetAccessor.rename_paths ¶
Rename paths to chunks in every ManifestArray in this dataset.
Accepts either a string, in which case this new path will be used for all chunks, or a function which accepts the old path and returns the new path.
Parameters:
-
new
(str | Callable[[str], str]
) –New path to use for all chunks, either as a string, or as a function which accepts and returns strings.
Returns:
-
Dataset
–
See Also
virtualizarr.ManifestArray.rename_paths
virtualizarr.ChunkManifest.rename_paths
Examples:
Rename paths to reflect moving the referenced files from local storage to an S3 bucket.
>>> def local_to_s3_url(old_local_path: str) -> str:
... from pathlib import Path
...
... new_s3_bucket_url = "http://s3.amazonaws.com/my_bucket/"
...
... filename = Path(old_local_path).name
... return str(new_s3_bucket_url / filename)
>>>
>>> ds.virtualize.rename_paths(local_to_s3_url)
Developer API¶
If you want to write a new parser to create virtual references pointing to a custom file format, you will need to use VirtualiZarr's internal classes. See the page on custom parsers for more information.
Manifests¶
VirtualiZarr uses these classes to store virtual references internally. See the page on data structures for more information.
virtualizarr.manifests.ChunkManifest ¶
In-memory representation of a single Zarr chunk manifest.
Stores the manifest internally as numpy arrays, so the most efficient way to create this object is via the .from_arrays
constructor classmethod.
The manifest can be converted to or from a dictionary which looks like this
{
"0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
"0.0.1": {"path": "s3://bucket/foo.nc", "offset": 200, "length": 100},
"0.1.0": {"path": "s3://bucket/foo.nc", "offset": 300, "length": 100},
"0.1.1": {"path": "s3://bucket/foo.nc", "offset": 400, "length": 100},
}
using the .init() and .dict() methods, so users of this class can think of the manifest as if it were a dict mapping zarr chunk keys to byte ranges.
(See the chunk manifest SPEC proposal in zarr-developers/zarr-specs#287.)
Validation is done when this object is instantiated, and this class is immutable, so it's not possible to have a ChunkManifest object that does not represent a valid grid of chunks.
nbytes
property
¶
nbytes: int
Size required to hold these references in memory in bytes.
Note this is not the size of the referenced chunks if they were actually loaded into memory, this is only the size of the pointers to the chunk locations. If you were to load the data into memory it would be ~1e6x larger for 1MB chunks.
ndim_chunk_grid
property
¶
ndim_chunk_grid: int
Number of dimensions in the chunk grid.
Not the same as the dimension of an array backed by this chunk manifest.
shape_chunk_grid
property
¶
Number of separate chunks along each dimension.
Not the same as the shape of an array backed by this chunk manifest.
__init__ ¶
Create a ChunkManifest from a dictionary mapping zarr chunk keys to byte ranges.
Parameters:
-
entries
(dict
) –Chunk keys and byte range information, as a dictionary of the form
{ "0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100}, "0.0.1": {"path": "s3://bucket/foo.nc", "offset": 200, "length": 100}, "0.1.0": {"path": "s3://bucket/foo.nc", "offset": 300, "length": 100}, "0.1.1": {"path": "s3://bucket/foo.nc", "offset": 400, "length": 100}, }
dict ¶
dict() -> ChunkDict
Convert the entire manifest to a nested dictionary.
The returned dict will be of the form
{
"0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
"0.0.1": {"path": "s3://bucket/foo.nc", "offset": 200, "length": 100},
"0.1.0": {"path": "s3://bucket/foo.nc", "offset": 300, "length": 100},
"0.1.1": {"path": "s3://bucket/foo.nc", "offset": 400, "length": 100},
}
Entries whose path is an empty string will be interpreted as missing chunks and omitted from the dictionary.
from_arrays
classmethod
¶
from_arrays(
*,
paths: ndarray[Any, StringDType],
offsets: ndarray[Any, dtype[uint64]],
lengths: ndarray[Any, dtype[uint64]],
validate_paths: bool = True,
) -> ChunkManifest
Create manifest directly from numpy arrays containing the path and byte range information.
Useful if you want to avoid the memory overhead of creating an intermediate dictionary first, as these 3 arrays are what will be used internally to store the references.
Parameters:
-
paths
(ndarray[Any, StringDType]
) –Array containing the paths to the chunks
-
offsets
(ndarray[Any, dtype[uint64]]
) –Array containing the byte offsets of the chunks
-
lengths
(ndarray[Any, dtype[uint64]]
) –Array containing the byte lengths of the chunks
-
validate_paths
(bool
, default:True
) –Check that entries in the manifest are valid paths (e.g. that local paths are absolute not relative). Set to False to skip validation for performance reasons.
rename_paths ¶
rename_paths(new: str | Callable[[str], str]) -> ChunkManifest
Rename paths to chunks in this manifest.
Accepts either a string, in which case this new path will be used for all chunks, or a function which accepts the old path and returns the new path.
Parameters:
-
new
(str | Callable[[str], str]
) –New path to use for all chunks, either as a string, or as a function which accepts and returns strings.
Returns:
-
manifest
–
See Also
ManifestArray.rename_paths
Examples:
Rename paths to reflect moving the referenced files from local storage to an S3 bucket.
>>> def local_to_s3_url(old_local_path: str) -> str:
... from pathlib import Path
...
... new_s3_bucket_url = "http://s3.amazonaws.com/my_bucket/"
...
... filename = Path(old_local_path).name
... return str(new_s3_bucket_url / filename)
>>>
>>> manifest.rename_paths(local_to_s3_url)
virtualizarr.manifests.ManifestArray ¶
Virtualized array representation of the chunk data in a single Zarr Array.
Supports concatenation / stacking, but only if the two arrays to be concatenated have the same codecs.
Cannot be directly altered.
Implements subset of the array API standard such that it can be wrapped by xarray. Doesn't store the zarr array name, zattrs or ARRAY_DIMENSIONS, as instead those can be stored on a wrapping xarray object.
nbytes_virtual
property
¶
nbytes_virtual: int
Size required to hold these references in memory in bytes.
Note this is not the size of the referenced array if it were actually loaded into memory, this is only the size of the pointers to the chunk locations. If you were to load the data into memory it would be ~1e6x larger for 1MB chunks.
__array_function__ ¶
__array_function__(func, types, args, kwargs) -> Any
Hook to teach this class what to do if np.concat etc. is called on it.
Use this instead of array_namespace so that we don't make promises we can't keep.
__array_ufunc__ ¶
__array_ufunc__(ufunc, method, *inputs, **kwargs) -> Any
We have to define this in order to convince xarray that this class is a duckarray, even though we will never support ufuncs.
__eq__ ¶
Element-wise equality checking.
Returns a numpy array of booleans.
__getitem__ ¶
__getitem__(
key: Union[
int,
slice,
EllipsisType,
None,
tuple[Union[int, slice, EllipsisType, None, ndarray], ...],
ndarray,
],
) -> ManifestArray
Only supports extremely limited indexing.
Only here because xarray will apparently attempt to index into its lazy indexing classes even if the operation would be a no-op anyway.
__init__ ¶
__init__(
metadata: ArrayV3Metadata | dict, chunkmanifest: dict | ChunkManifest
) -> None
Create a ManifestArray directly from the metadata of a zarr array and the manifest of chunks.
Parameters:
-
metadata
(dict or ArrayV3Metadata
) – -
chunkmanifest
(dict or ChunkManifest
) –
astype ¶
astype(dtype: dtype, /, *, copy: bool = True) -> ManifestArray
Cannot change the dtype, but needed because xarray will call this even when it's a no-op.
rename_paths ¶
rename_paths(new: str | Callable[[str], str]) -> ManifestArray
Rename paths to chunks in this array's manifest.
Accepts either a string, in which case this new path will be used for all chunks, or a function which accepts the old path and returns the new path.
Parameters:
-
new
(str | Callable[[str], str]
) –New path to use for all chunks, either as a string, or as a function which accepts and returns strings.
Returns:
See Also
ChunkManifest.rename_paths
Examples:
Rename paths to reflect moving the referenced files from local storage to an S3 bucket.
>>> def local_to_s3_url(old_local_path: str) -> str:
... from pathlib import Path
...
... new_s3_bucket_url = "http://s3.amazonaws.com/my_bucket/"
...
... filename = Path(old_local_path).name
... return str(new_s3_bucket_url / filename)
>>>
>>> marr.rename_paths(local_to_s3_url)
virtualizarr.manifests.ManifestGroup ¶
Bases: Mapping[str, 'ManifestArray | ManifestGroup']
Immutable representation of a single virtual zarr group.
__init__ ¶
__init__(
arrays: Mapping[str, ManifestArray] | None = None,
groups: Mapping[str, "ManifestGroup"] | None = None,
attributes: dict | None = None,
) -> None
Create a ManifestGroup containing ManifestArrays and/or sub-groups, as well as any group-level metadata.
Parameters:
-
arrays
(Mapping[str, ManifestArray]
, default:None
) –ManifestArray objects to represent virtual zarr arrays.
-
groups
(Mapping[str, ManifestGroup]
, default:None
) –ManifestGroup objects to represent virtual zarr subgroups.
-
attributes
(dict
, default:None
) –Zarr attributes to add as zarr group metadata.
virtualizarr.manifests.ManifestStore ¶
Bases: Store
A read-only Zarr store that uses obstore to read data from inside arbitrary files on AWS, GCP, Azure, or a local filesystem.
The requests from the Zarr API are redirected using the :class:virtualizarr.manifests.ManifestGroup
containing
multiple :class:virtualizarr.manifests.ManifestArray
, allowing for virtually interfacing with underlying data in other file formats.
Parameters:
-
group
(ManifestGroup
) –Root group of the store. Contains group metadata, ManifestArrays, and any subgroups.
-
store_registry
(ObjectStoreRegistry
, default:None
) –ObjectStoreRegistry that maps the URL scheme and netloc to ObjectStore instances, allowing ManifestStores to read from different ObjectStore instances.
Warnings
ManifestStore is experimental and subject to API changes without notice. Please raise an issue with any comments/concerns about the store.
Notes
Modified from zarr-developers/zarr-python!1661
__init__ ¶
__init__(
group: ManifestGroup, *, store_registry: ObjectStoreRegistry | None = None
) -> None
Instantiate a new ManifestStore.
Parameters:
-
group
(ManifestGroup
) –Manifest Group containing Group metadata and mapping variable names to ManifestArrays
-
store_registry
(ObjectStoreRegistry | None
, default:None
) –A registry mapping the URL scheme and netloc to ObjectStore instances, allowing ManifestStores to read from different ObjectStore instances.
Array API¶
VirtualiZarr's virtualizarr.manifests.ManifestArray objects support a limited subset of the Python Array API standard in virtualizarr.manifests.array_api
.
virtualizarr.manifests.array_api.concatenate ¶
concatenate(
arrays: tuple[ManifestArray, ...] | list[ManifestArray],
/,
*,
axis: int | None = 0,
) -> ManifestArray
Concatenate ManifestArrays by merging their chunk manifests.
The signature of this function is array API compliant, so that it can be called by xarray.concat
.
virtualizarr.manifests.array_api.stack ¶
stack(
arrays: tuple[ManifestArray, ...] | list[ManifestArray], /, *, axis: int = 0
) -> ManifestArray
Stack ManifestArrays by merging their chunk manifests.
The signature of this function is array API compliant, so that it can be called by xarray.stack
.
virtualizarr.manifests.array_api.expand_dims ¶
expand_dims(x: ManifestArray, /, *, axis: int = 0) -> ManifestArray
Expands the shape of an array by inserting a new axis (dimension) of size one at the position specified by axis.
virtualizarr.manifests.array_api.broadcast_to ¶
broadcast_to(x: ManifestArray, /, shape: tuple[int, ...]) -> ManifestArray
Broadcasts a ManifestArray to a specified shape, by either adjusting chunk keys or copying chunk manifest entries.
Parser typing protocol¶
All custom parsers must follow the virtualizarr.parsers.typing.Parser
typing protocol.
Parallelization¶
Parallelizing virtual reference generation can be done using a number of parallel execution frameworks. Advanced users may want to call one of these executors directly. See the docs page on Scaling.
virtualizarr.parallel.SerialExecutor ¶
Bases: Executor
A custom Executor that runs tasks sequentially, mimicking the concurrent.futures.Executor interface. Useful as a default and for debugging.
map ¶
map(
fn: Callable[..., T],
*iterables: Iterable[Any],
timeout: float | None = None,
chunksize: int = 1,
) -> Iterator[T]
Execute a function over an iterable sequentially.
Parameters:
-
fn
(Callable[..., T]
) –Function to apply to each item
-
*iterables
(Iterable[Any]
, default:()
) –Iterables to process
-
timeout
(float | None
, default:None
) –Optional timeout (ignored in serial execution)
Returns:
-
Generator of results
–
shutdown ¶
Shutdown the executor.
Parameters:
-
wait
(bool
, default:True
) –Whether to wait for pending futures (always True for serial executor)
submit ¶
Submit a callable to be executed.
Unlike parallel executors, this runs the task immediately and sequentially.
Parameters:
-
fn
(Callable[..., T]
) –The callable to execute
-
*args
(Any
, default:()
) –Positional arguments for the callable
-
**kwargs
(Any
, default:{}
) –Keyword arguments for the callable
Returns:
-
A Future representing the result of the execution
–
virtualizarr.parallel.DaskDelayedExecutor ¶
Bases: Executor
An Executor that uses dask.delayed for parallel computation.
This executor mimics the concurrent.futures.Executor interface but uses Dask's delayed computation model.
map ¶
map(
fn: Callable[..., T],
*iterables: Iterable[Any],
timeout: float | None = None,
chunksize: int = 1,
) -> Iterator[T]
Apply a function to an iterable using dask.delayed.
Parameters:
-
fn
(Callable[..., T]
) –Function to apply to each item
-
*iterables
(Iterable[Any]
, default:()
) –Iterables to process
-
timeout
(float | None
, default:None
) –Optional timeout (ignored in serial execution)
Returns:
-
Generator of results
–
shutdown ¶
Shutdown the executor
Parameters:
-
wait
(bool
, default:True
) –Whether to wait for pending futures (always True for serial executor))
submit ¶
Submit a task to be computed with dask.delayed.
Parameters:
-
fn
(Callable[..., T]
) –The callable to execute
-
*args
(Any
, default:()
) –Positional arguments for the callable
-
**kwargs
(Any
, default:{}
) –Keyword arguments for the callable
Returns:
-
A Future representing the result of the execution
–
virtualizarr.parallel.LithopsEagerFunctionExecutor ¶
Bases: Executor
Lithops-based function executor which follows the concurrent.futures.Executor API.
Only required because lithops doesn't follow the concurrent.futures.Executor API, see lithops-cloud/lithops#1427.
map ¶
map(
fn: Callable[..., T],
*iterables: Iterable[Any],
timeout: float | None = None,
chunksize: int = 1,
) -> Iterator[T]
Apply a function to an iterable using lithops.
Only needed because lithops.FunctionExecutor.map returns futures, unlike concurrent.futures.Executor.map
.
Parameters:
-
fn
(Callable[..., T]
) –Function to apply to each item
-
*iterables
(Iterable[Any]
, default:()
) –Iterables to process
-
timeout
(float | None
, default:None
) –Optional timeout (ignored in serial execution)
Returns:
-
Generator of results
–
shutdown ¶
Shutdown the executor.
Parameters:
-
wait
(bool
, default:True
) –Whether to wait for pending futures.
submit ¶
Submit a task to be computed using lithops.
Parameters:
-
fn
(Callable[..., T]
) –The callable to execute
-
*args
(Any
, default:()
) –Positional arguments for the callable
-
**kwargs
(Any
, default:{}
) –Keyword arguments for the callable
Returns:
-
A concurrent.futures.Future representing the result of the execution
–