API

Basic Usage

Parquet2.DatasetType
Dataset <: ParquetTable

A complete parquet dataset created from top-level parquet metadata. Each Dataset is an indexable collection of RowGroups each of which is a Tables.jl compatible columnar table in its own right. The Dataset is a Tables.jl compatible columnar table consisting of (lazily by default) concatenated RowGroups. A Dataset can consist of data in any number of files depending on the directory structure of the referenced parquet.

Constructors

Dataset(fm::FileManager; kw...)
Dataset(p::AbstractPath; kw...)
Dataset(v::AbstractVector{UInt8}; kw...)
Dataset(io::IO; kw...)
Dataset(str::AbstractString; kw...)

Arguments

  • fm: A FileManager object describing a set of files to be loaded.
  • p: Path to main metadata file or directory containing a _metadata file. Loading behavior will depend on the type of path provided.
  • v: An in-memory (or memory mapped) byte buffer.
  • io: An IO object from which data can be loaded.
  • str: File or directory path as a string. Converted to AbstractPath with Path(str).

Keyword Arguments

The following keyword arguments are applicable for the dataset as a whole:

  • support_legacy (true): Some parquet writers take bizarre liberties with the metadata, in particular many JVM-based writers use a specialized UInt96 encoding of timestamps even though this is not described by the metadata. When this option is false the metadata will be interpreted strictly.
  • use_mmap (true): Whether to use memory mapping for reading the file. Only applicable for files on the local file system. In some cases enabling this can drastically increase read performance.
  • mmap_shared (true): Whether memory mapped buffers can be shared with other processes. See documentation for Mmap.mmap.
  • preload (false): Whether all data should be fetched on constructing the Dataset regardless of the above options.
  • load_initial (nothing): Whether the RowGroups should be eagerly loaded into memory. If nothing, this will be done only for parquets consisting of a single file.
  • parallel_column_loading: Whether columns should be loaded using thread-based parallelism. If nothing, this is true as long as Julia has multiple threads available to it.

The following keyword arguments are applicable to specific columns. These can be passed either as a single value, a NamedTuple, AbstractDict or ColumnOption. See ColumnOption for details.

  • allow_string_copying (false): Whether strings will be copied. If false a reference to the underlying data buffer needs to be maintained, meaning it can't be garbage collected to free up memory. Note also that there will potentially be a large number of references stored in the output colun if this is false, so setting this to true reduces garbage collector overhead.
  • lazy_dictionary (true): Whether output columns will use a Julia categorical array representation which in some cases can ellide a large number of allocations.
  • parallel_page_loading (false): Whether data pages in the column should be loaded in parallel. This comes with some additional overhead including an extra iteration over the entire page buffer, so it is of dubious benefit to turn this on, but it may be helpful in cases in which there is a large number of pages.
  • use_statistics (false): Whether statistics included in the metadata will be used in the loaded column AbstractVectors so that statistics can be efficiently retrieved rather than being re-computed. Note that if this is true this will only be done for columns for which statistics are available. Otherwise, statistics can be retrieved with ColumnStatistics(col).
  • eager_page_scanning (true): It is not in general possible to infer all page metadata without iterating over the columns entire data buffer. This can be elided, but doing so limits what can be done to accommodate data loaded from the column. Turning this option off will reduce the overhead of loading metadata for the column but may increase the cost of allocating the output. If false specialized string and dictionary outputs will not be used (loading the column will be maximally allocating).

Usage

ds = Dataset("/path/to/parquet")
ds = Dataset(p"s3://path/to/parquet")  # understands different path types

length(ds)  # gives number of row groups

rg = ds[1]  # index to get row groups

for rg ∈ ds  # is an indexable, iterable collection of row groups
    println(rg)
end

df = DataFrame(ds)  # Tables.jl compatible, is concatenation of all row groups

# use TableOperations.jl to load only selected columns
df = ds |> TableOperations.select(:col1, :col2) |> DataFrame
source
Parquet2.readfileFunction
readfile(filename; kw...)
readfile(io::IO; kw...)

An alias for Dataset. All arguments are the same, so see those docs.

This function is provided for consistency with the writefile function.

source
Parquet2.FileWriterType
FileWriter

Data structure holding metadata inferred during the process of writing a parquet file.

A full table can be written with writetable!, for a more detailed example, see below.

Constructors

FileWriter(io, path; kw...)
FileWriter(path; kw...)

Arguments

  • io: the IO object to which data will be written.
  • path: the path of the file being written. This is used in parquet metadata which is why it is possible to specify the path separately from the IO-stream.

Keyword Arguments

The following arguments are relevant for the entire file:

  • metadata (Dict()): Additional metadata to append at file-level. Must provide an AbstractDict, the keys and values must both be strings. This can be accessed from a written file with Parquet2.metadata.
  • propagate_table_metadata (true): Whether to propagate table metadata provided by the DataAPI.jl metadata interface for tables written to this file. If true and multiple tables are written, the metadata will be merged. If this is undesirable users should set this to false and set via metadata instead. The metadata argument above will be merged with table metadata (with metadata from the option taking precedence).

The following arguments apply to specific columns and can be provided as a single value, NamedTuple, AbstractDict or ColumnOption. See ColumnOption for details.

  • npages (1): The number of pages to write. Some parquet readers are more efficient at reading multiple pages for large numbers of columns, but for the most part there's no reason to change this.
  • compression_codec (:snappy): Compression codec to use. Available options are :uncompressed, :snappy, :gzip and :zstd.
  • column_metadata (Dict()): Additional metadata for specific columns. This works the same way as file-level metadata and must be a dictionary with string keys and values. Can be accessed from a written file by calling Parquet2.metadata on column objects.
  • compute_statistics (false): Whether column statistics (minimum, maximum, number of nulls) should be computed when the file is written and stored in metadata. When read back with Dataset, the loaded columns will be wrapped in a struct allowing these statistics to be efficiently retrieved, see VectorWithStatistics.
  • json_columns (false): Columns which should be JSON encoded. Columns with types which can be naturally encoded as JSON but which have no other supported types, that is AbstractVector and AbstractDict columns, will be JSON encoded regardless of the value of this argument.
  • bson_columns (false): Columns which should be BSON encoded. By default, columns which need special encoding are JSON encoding, so they must be specified here to force them to be BSON.
  • propagate_col_metadata (true): Whether to propagate column metadata provided by the DataAPI.jl metadata interface. Metadata set with the column_metadata argument will be merged with this with the former taking precedence.

Examples

open(filename, write=true) do io
    fw = Parquet2.FileWriter(io)
    Parquet2.writeiterable!(io, tbls)  # write tables as separate row groups, finalization is done automatically
end

df = DataFrame(A=1:5, B=randn(5))

# use `writefile` to write in a single call
writefile(filename, df)

# write to `IO` object
io = IOBuffer()
writefile(io, df)

# write to an `AbstractVector` buffer.
v = writefile(Vector{UInt8}, df)
source
Parquet2.writefileFunction
writefile(io::IO, path, tbl; kw...)
writefile(path, tbl; kw...)

Write the Tables.jl compatible table tbl to the IO or the file at path. Note that the path is used in parquet metadata, which is why it is possible to specify the path separately from the io stream. See FileWriter for a description of all possible arguments.

This function writes a file all in one call. Files will be written as one parquet row group per table partition. An intermediate FileWriter object is used.

source
Parquet2.filelistFunction
filelist(ds::Dataset)
filelist(fm::FileManager)

Returns an AbstractVector containing the paths of all files associated with the dataset.

source
Parquet2.showtreeFunction
showtree([io=stdout,] ds::Dataset)

Show the "hive/drill" directory tree of the dataset. The pairs printed in this tree can be passed as arguments to append! to append the corresponding row group to the dataset.

source
Base.append!Method
append!(ds::Parquet2.Dataset, col=>val...; check=true)
append!(ds::Parquet2.Dataset; check=true, kw...)

Append row groups for which the columns specified by col have the value val. This applies only to "hive/drill" partition columns in file trees, therefore col and val must both be strings. The selected row groups must satisfy all passed pairs.

Alternatively, these can be passed as keyword arguments with the column names as the keys and the (string) values as the value constraints.

Examples

◖◗ showtree(ds)
Root()
├─ "A" => "1"
│  └─ "B" => "alpha"
├─ "A" => "2"
│  └─ "B" => "alpha"
└─ "A" => "3"
   └─ "B" => "beta"

◖◗ append!(ds, "A"=>"2", "B"=>"alpha", verbose=true);
[ Info: appended row group from file $HOME/data/hive_fastparquet.parq/A=2/B=alpha/part.0.parquet

◖◗ append!(ds, A="3", B="alpha");  # in this case nothing is appended since now such row group exists
source
Base.append!Method
append!(ds::Dataset, i::Integer; check=true)

Append row group number i to the dataset. The index i is the index of the array returned by filelist, that is, this is equivalent to append!(ds, filelist(ds)[i]).

source
Base.append!Method
append!(ds::Parquet2.Dataset, col=>val...; check=true)
append!(ds::Parquet2.Dataset; check=true, kw...)

Append row groups for which the columns specified by col have the value val. This applies only to "hive/drill" partition columns in file trees, therefore col and val must both be strings. The selected row groups must satisfy all passed pairs.

Alternatively, these can be passed as keyword arguments with the column names as the keys and the (string) values as the value constraints.

Examples

◖◗ showtree(ds)
Root()
├─ "A" => "1"
│  └─ "B" => "alpha"
├─ "A" => "2"
│  └─ "B" => "alpha"
└─ "A" => "3"
   └─ "B" => "beta"

◖◗ append!(ds, "A"=>"2", "B"=>"alpha", verbose=true);
[ Info: appended row group from file $HOME/data/hive_fastparquet.parq/A=2/B=alpha/part.0.parquet

◖◗ append!(ds, A="3", B="alpha");  # in this case nothing is appended since now such row group exists
source
Base.append!Method
append!(ds::Parquet2.Dataset, p; check=true, verbose=false)

Append all row groups from the file p to the dataset row group metadata. If check, will check if path is a valid parquet file first. p must be a path that was discovered during the initial construction of the dataset.

If verbose=true an INFO level logging message will be printed for each appended row group.

source
Parquet2.appendall!Function
appendall!(ds::Dataset; check=true)

Append all row groups to the dataset.

WARNING: Some parquet directory trees can be huge. This function does nothing to check that what you are about to do is a good idea, so use it judiciously.

source
Parquet2.ColumnOptionType
ColumnOption{𝒯}

A container for a column-specific read or write option with value type 𝒯. Contains sets of names and types for determining what option to apply to a column. Column-specific keyword arguments passed to [Dataset] and [FileWriter] will be converted to ColumnOptions.

The provide arguments must be one of the following:

  • A single value of the appropriate type, in which case this option will be applied to all columns.
  • A NamedTuple the keys of which are column names and the values of which are the value to be applied to the corresponding column. Columns not listed will use the default option for that keyword argument.
  • An AbstractDict the keys of which are the column names as strings. This works analogously to NamedTuple.
  • An AbstractDict the keys of which are types and the values of which are options to be applied to all columns with element types which are subtypes of the provided type.
  • A Pair will be treated as a dictionary with a single entry.

Constructors

ColumnOption(dict_value_or_namedtuple, default)

Users may wish to construct a ColumnOption and pass it as an argument to set their own default.

Examples

# enable parallel page loading for *all* columns
Dataset(filename; parallel_page_loading=true)

# enable parallel page loading for column `col1`
Dataset(filename; parallel_page_loading=(col1=true,))

# columns `col1` and `col2` will be written with 2 and 3 pages respectively, else 1 page
writefile(filename; npages=Dict("col1"=>2, "col2"=>3))

# `col1` will use snappy compression, all other columns will use zstd
writefile(filename; compression_codec=Parquet2.ColumnOption((col1=:snappy), :zstd))

# All dictionary columns will be encoded as BSON
writefile(filename; bson_columns=Dict(AbstractDict=>true))
source
Parquet2.RowGroupType
RowGroup <: ParquetTable

A piece of a parquet table. All parquet files are organized into 1 or more RowGroups each of which is a table in and of itself. RowGroup satisfies the Tables.jl columnar interface. Therefore, all row groups can be used as tables just like full Datasets. Typically different RowGroups are stored in different files and each file constitutes and entire RowGroup, though this is not enforced by the specification or Parquet2.jl. It is not expected for users to construct tables as their schema is constructed from parquet metadata.

Datasets are indexable collections of RowGroups.

Usage

ds = Dataset("/path/to/parquet")

length(ds)  # gives the number of row groups

rg = ds[1]  # get first row group

c = rg[1]  # get first column
c = rg["column_name"]  # or by name

for c ∈ rg  # RowGroups are indexable collections of columns
    println(name(c))
end

df = DataFrame(rg)  # RowGroups are bonified columnar tables themselves

# use TableOperations.jl to load only selected columns
df1 = rg |> TableOperations.select(:col1, :col2) |> DataFrame
source
Parquet2.ColumnType
Column

Data structure for organizing metadata and loading data of a parquet column object. These columns are the segments of columns referred to by individual row groups, not necessarily the entire columns of the master table schema. As such, these will have the same type of the columns in the full table but not necessarily the same number of values.

Usage

c = rg[n]  # returns nth `Column` from row group
c = rg["column_name"]  # retrieve by name

Parquet2.pages!(c)  # infer page schema of columns

Parquet2.name(c)  # get the name of c

Parquet2.filepath(c)  # get the path of the file containing c

v = Parquet2.load(c)  # load column values as a lazy AbstractVector

v[:]  # fully load values into memory
source
Parquet2.loadFunction
load(ds::Dataset, n)

Load the (complete, all RowGroups) column n (integer or string) from the dataset.

source
load(c::Column)
load(rg::RowGroup, column_name)
load(ds::Dataset, column_name)

Deserialize values from a parquet column as an AbstractVector object. Options for this are defined when the file containing the column is first initialized.

Column name can be either a string column name or an integer column number.

source
DataAPI.metadataFunction
metadata(col::Column; style=false)

Get the key-value metadata for the column.

source
metadata(col::Column, k::AbstractString[, default]; style=false)

Get the key k from the key-value metadata for column col. If default is provided it will be returned if k is not present.

source
metadata(ds::Dataset; style=false)

Get the auxiliary key-value metadata for the dataset.

Note that Dataset does not support DataAPI.colmetadata because it contains one instance of each column per row group. To access column metadata either call metadata on Column objects or colmetadata on RowGroup objects.

source
metadata(ds::Dataset, k::AbstractString[, default]; style=false)

Get the key k from the key-value metadata for the dataset. If default is provided it will be returned if k is not present.

source
Parquet2.writeiterable!Function
writeiterable!(fw::FileWriter, tbls)

Write each table returned by the iterable over Tables.jl compatible tables tbls to the parquet file. The file will then be finalized so that no further data can be written to it.

source
Base.closeMethod
close(ds::Dataset)

Close the Dataset, deleting all file buffers and row groups and freeing the memory. If the buffers are memory-mapped, this will free associated file handles. Note that memory and handles are only freed once garbage collection is executed (can be forced with GC.gc()).

source

Schema and Introspection

Parquet2.PageType
Page

Object containing metadata for parquet pages. These are esesentially subsets of the data of a column. The raw data contained in the page can be accessed with view(page).

source
Parquet2.ColumnStatisticsType
ColumnStatistics

A data structure for storing the statistics for a parquet column. The following functions are available for accessing statistics. In all cases, will return nothing if the statistic was not included in the parquet metadata.

  • minimum(stats): The minimum value.
  • maximum(stats): The maximum value.
  • count(ismissing, stats): The number of missing values.
  • ndistinct(stats): The number of distinct values.

Can be obtained from a Column object with ColumnStatistics(col).

source
Parquet2.VectorWithStatisticsType
VectorWithStatistics{𝒯,𝒮,𝒱<:AbstractVector{𝒯}} <: AbstractVector{𝒯}

A wrapper for an AbstractVector object which can store the following statistics:

  • minimum value, accessible with minimum(v)
  • maximum value, accessible with maximum(v)
  • number of missings, accessible with count(ismissing, v)
  • number of distinct elements, accessible with ndistinct(v).

Methods are provided so that the stored values are returned rather than re-computing the values when these functions are called. Note that a method is also provided for count(!ismissing, v) so this should also be efficient.

The use_statistics option for Dataset controls whether columns are loaded with statistics.

source
Parquet2.ndistinctFunction
ndistinct(s::ColumnStatistics)

Returns the number of distinct elements in the column. nothing if not available.

source
ndistinct(v::AbstractVector)

Get the number of distinct elements in v. If v is a VectorWithStatistics, as returned from parquet columns when metadata is available, computation will be elided and the stored value will be used instead.

source
Parquet2.PageHeaderType
PageHeader

Abstract type for parquet format page headers.

See the description of pages in the specification here.

source
Parquet2.DataPageHeaderType
DataPageHeader <: PageHeader

Header for a page of data. This type stores metadata for either the newer DataHeaderV2 or legacy DataHeader.

source
Parquet2.parqtypeFunction
parqtype(t::Type; kw...)

Return the parquet type object corresponding to the provided Julia type.

The following keyword arguments should be provided for context only where appropriate

  • decimal_scale=0: base 10 scale of a decimal number
  • decimal_precision=3: precision of a decimal number.
  • bson=false: whether serialization of dictionaries should prefer BSON to JSON.

Only one method with the signature ::Type is defined so to avoid excessive run-time dispatch.

source
parqtype(s)

Gets the ParquetType for elements of the object s, e.g. a Column or SchemaNode. See this section of the parquet specification.

source
Parquet2.juliatypeFunction
juliatype(col::Column)

Get the element type of the AbstractVector the column is loaded into ignoring missings. For example, if the eltype is Union{Int,Missing} this will return Int.

See juliamissingtype for the exact type.

source
Parquet2.nvaluesFunction
nvalues(col::Column)

Returns the number of values in the column (i.e. number of rows).

source
Parquet2.isdictencodedFunction
isdictencoded(col::Column)

Returns true if all data in the column is dictionary encoded.

This will force the scanning of pages.

source
Parquet2.pagesFunction
pages(col::Column)

Accesses the pages of the column, loading them if they are not already loaded. See pages! which is called by this in cases where pages are not already discovered.

source
Parquet2.pages!Function
pages!(col::Column)

Infer the binary schema of the column pages and store Page objects that store references to data page locations. This function should typically be called only once as the objects discovered by this store all needed metadata. Calling this may invoke calls to retrieve data from the source. After calling this all data for the column is guaranteed to be stored in memory.

source

Internals

See Internals.