API

Basic Usage

Parquet2.Dataset — Type

Dataset <: ParquetTable

A complete parquet dataset created from top-level parquet metadata. Each Dataset is an indexable collection of RowGroups each of which is a Tables.jl compatible columnar table in its own right. The Dataset is a Tables.jl compatible columnar table consisting of (lazily by default) concatenated RowGroups. A Dataset can consist of data in any number of files depending on the directory structure of the referenced parquet.

Constructors

Dataset(fm::FileManager; kw...)
Dataset(p::AbstractPath; kw...)
Dataset(v::AbstractVector{UInt8}; kw...)
Dataset(io::IO; kw...)
Dataset(str::AbstractString; kw...)

Arguments

fm: A FileManager object describing a set of files to be loaded.
p: Path to main metadata file or directory containing a _metadata file. Loading behavior will depend on the type of path provided.
v: An in-memory (or memory mapped) byte buffer.
io: An IO object from which data can be loaded.
str: File or directory path as a string. Converted to AbstractPath with Path(str).

Keyword Arguments

The following keyword arguments are applicable for the dataset as a whole:

support_legacy (true): Some parquet writers take bizarre liberties with the metadata, in particular many JVM-based writers use a specialized UInt96 encoding of timestamps even though this is not described by the metadata. When this option is false the metadata will be interpreted strictly.
use_mmap (true): Whether to use memory mapping for reading the file. Only applicable for files on the local file system. In some cases enabling this can drastically increase read performance.
mmap_shared (true): Whether memory mapped buffers can be shared with other processes. See documentation for Mmap.mmap.
preload (false): Whether all data should be fetched on constructing the Dataset regardless of the above options.
load_initial (nothing): Whether the RowGroups should be eagerly loaded into memory. If nothing, this will be done only for parquets consisting of a single file.
parallel_column_loading: Whether columns should be loaded using thread-based parallelism. If nothing, this is true as long as Julia has multiple threads available to it.

The following keyword arguments are applicable to specific columns. These can be passed either as a single value, a NamedTuple, AbstractDict or ColumnOption. See ColumnOption for details.

allow_string_copying (false): Whether strings will be copied. If false a reference to the underlying data buffer needs to be maintained, meaning it can't be garbage collected to free up memory. Note also that there will potentially be a large number of references stored in the output colun if this is false, so setting this to true reduces garbage collector overhead.
lazy_dictionary (true): Whether output columns will use a Julia categorical array representation which in some cases can ellide a large number of allocations.
parallel_page_loading (false): Whether data pages in the column should be loaded in parallel. This comes with some additional overhead including an extra iteration over the entire page buffer, so it is of dubious benefit to turn this on, but it may be helpful in cases in which there is a large number of pages.
use_statistics (false): Whether statistics included in the metadata will be used in the loaded column AbstractVectors so that statistics can be efficiently retrieved rather than being re-computed. Note that if this is true this will only be done for columns for which statistics are available. Otherwise, statistics can be retrieved with ColumnStatistics(col).
eager_page_scanning (true): It is not in general possible to infer all page metadata without iterating over the columns entire data buffer. This can be elided, but doing so limits what can be done to accommodate data loaded from the column. Turning this option off will reduce the overhead of loading metadata for the column but may increase the cost of allocating the output. If false specialized string and dictionary outputs will not be used (loading the column will be maximally allocating).

Usage

ds = Dataset("/path/to/parquet")
ds = Dataset(p"s3://path/to/parquet")  # understands different path types

length(ds)  # gives number of row groups

rg = ds[1]  # index to get row groups

for rg ∈ ds  # is an indexable, iterable collection of row groups
    println(rg)
end

df = DataFrame(ds)  # Tables.jl compatible, is concatenation of all row groups

# use TableOperations.jl to load only selected columns
df = ds |> TableOperations.select(:col1, :col2) |> DataFrame

API

Basic Usage

Schema and Introspection

Internals