API
Parquet2.BitUnpackVector
Parquet2.Column
Parquet2.ColumnOption
Parquet2.ColumnStatistics
Parquet2.DataPageHeader
Parquet2.Dataset
Parquet2.DictionaryPageHeader
Parquet2.FileWriter
Parquet2.HybridIterator
Parquet2.OptionSet
Parquet2.Page
Parquet2.PageBuffer
Parquet2.PageHeader
Parquet2.PageIterator
Parquet2.PageLoader
Parquet2.ParqRefVector
Parquet2.ParquetType
Parquet2.PartitionNode
Parquet2.PooledVector
Parquet2.ReadOptions
Parquet2.RowGroup
Parquet2.SchemaNode
Parquet2.VectorWithStatistics
Parquet2.WriteOptions
Base.append!
Base.append!
Base.append!
Base.append!
Base.close
DataAPI.metadata
Parquet2.appendall!
Parquet2.bitjustify
Parquet2.bitmask
Parquet2.bitpack
Parquet2.bitpack!
Parquet2.bitwidth
Parquet2.bytewidth
Parquet2.decompressedpageview
Parquet2.encodehybrid_bitpacked
Parquet2.encodehybrid_rle
Parquet2.filelist
Parquet2.iscompressed
Parquet2.isdictencoded
Parquet2.juliamissingtype
Parquet2.juliatype
Parquet2.leb128decode
Parquet2.leb128encode
Parquet2.load
Parquet2.maxdeflevel
Parquet2.maxreplevel
Parquet2.ndistinct
Parquet2.nvalues
Parquet2.pages
Parquet2.pages!
Parquet2.parqtype
Parquet2.readfile
Parquet2.readfixed
Parquet2.showtree
Parquet2.writefile
Parquet2.writefixed
Parquet2.writeiterable!
Basic Usage
Parquet2.Dataset
— TypeDataset <: ParquetTable
A complete parquet dataset created from top-level parquet metadata. Each Dataset
is an indexable collection of RowGroup
s each of which is a Tables.jl compatible columnar table in its own right. The Dataset
is a Tables.jl compatible columnar table consisting of (lazily by default) concatenated RowGroup
s. A Dataset
can consist of data in any number of files depending on the directory structure of the referenced parquet.
Constructors
Dataset(fm::FileManager; kw...)
Dataset(p::AbstractPath; kw...)
Dataset(v::AbstractVector{UInt8}; kw...)
Dataset(io::IO; kw...)
Dataset(str::AbstractString; kw...)
Arguments
fm
: AFileManager
object describing a set of files to be loaded.p
: Path to main metadata file or directory containing a_metadata
file. Loading behavior will depend on the type of path provided.v
: An in-memory (or memory mapped) byte buffer.io
: AnIO
object from which data can be loaded.str
: File or directory path as a string. Converted toAbstractPath
withPath(str)
.
Keyword Arguments
The following keyword arguments are applicable for the dataset as a whole:
support_legacy
(true
): Some parquet writers take bizarre liberties with the metadata, in particular many JVM-based writers use a specializedUInt96
encoding of timestamps even though this is not described by the metadata. When this option isfalse
the metadata will be interpreted strictly.use_mmap
(true
): Whether to use memory mapping for reading the file. Only applicable for files on the local file system. In some cases enabling this can drastically increase read performance.mmap_shared
(true
): Whether memory mapped buffers can be shared with other processes. See documentation forMmap.mmap
.preload
(false
): Whether all data should be fetched on constructing theDataset
regardless of the above options.load_initial
(nothing
): Whether theRowGroup
s should be eagerly loaded into memory. Ifnothing
, this will be done only for parquets consisting of a single file.parallel_column_loading
: Whether columns should be loaded using thread-based parallelism. Ifnothing
, this is true as long as Julia has multiple threads available to it.
The following keyword arguments are applicable to specific columns. These can be passed either as a single value, a NamedTuple
, AbstractDict
or ColumnOption
. See ColumnOption
for details.
allow_string_copying
(false
): Whether strings will be copied. Iffalse
a reference to the underlying data buffer needs to be maintained, meaning it can't be garbage collected to free up memory. Note also that there will potentially be a large number of references stored in the output colun if this isfalse
, so setting this totrue
reduces garbage collector overhead.lazy_dictionary
(true
): Whether output columns will use a Julia categorical array representation which in some cases can ellide a large number of allocations.parallel_page_loading
(false
): Whether data pages in the column should be loaded in parallel. This comes with some additional overhead including an extra iteration over the entire page buffer, so it is of dubious benefit to turn this on, but it may be helpful in cases in which there is a large number of pages.use_statistics
(false
): Whether statistics included in the metadata will be used in the loaded columnAbstractVector
s so that statistics can be efficiently retrieved rather than being re-computed. Note that if this istrue
this will only be done for columns for which statistics are available. Otherwise, statistics can be retrieved withColumnStatistics(col)
.eager_page_scanning
(true
): It is not in general possible to infer all page metadata without iterating over the columns entire data buffer. This can be elided, but doing so limits what can be done to accommodate data loaded from the column. Turning this option off will reduce the overhead of loading metadata for the column but may increase the cost of allocating the output. Iffalse
specialized string and dictionary outputs will not be used (loading the column will be maximally allocating).
Usage
ds = Dataset("/path/to/parquet")
ds = Dataset(p"s3://path/to/parquet") # understands different path types
length(ds) # gives number of row groups
rg = ds[1] # index to get row groups
for rg ∈ ds # is an indexable, iterable collection of row groups
println(rg)
end
df = DataFrame(ds) # Tables.jl compatible, is concatenation of all row groups
# use TableOperations.jl to load only selected columns
df = ds |> TableOperations.select(:col1, :col2) |> DataFrame
Parquet2.readfile
— Functionreadfile(filename; kw...)
readfile(io::IO; kw...)
An alias for Dataset
. All arguments are the same, so see those docs.
This function is provided for consistency with the writefile
function.
Parquet2.FileWriter
— TypeFileWriter
Data structure holding metadata inferred during the process of writing a parquet file.
A full table can be written with writetable!
, for a more detailed example, see below.
Constructors
FileWriter(io, path; kw...)
FileWriter(path; kw...)
Arguments
io
: theIO
object to which data will be written.path
: the path of the file being written. This is used in parquet metadata which is why it is possible to specify the path separately from the IO-stream.
Keyword Arguments
The following arguments are relevant for the entire file:
metadata
(Dict()
): Additional metadata to append at file-level. Must provide anAbstractDict
, the keys and values must both be strings. This can be accessed from a written file withParquet2.metadata
.propagate_table_metadata
(true
): Whether to propagate table metadata provided by the DataAPI.jl metadata interface for tables written to this file. Iftrue
and multiple tables are written, the metadata will be merged. If this is undesirable users should set this tofalse
and set viametadata
instead. Themetadata
argument above will be merged with table metadata (with metadata from the option taking precedence).
The following arguments apply to specific columns and can be provided as a single value, NamedTuple
, AbstractDict
or ColumnOption
. See ColumnOption
for details.
npages
(1
): The number of pages to write. Some parquet readers are more efficient at reading multiple pages for large numbers of columns, but for the most part there's no reason to change this.compression_codec
(:snappy
): Compression codec to use. Available options are:uncompressed
,:snappy
,:gzip
and:zstd
.column_metadata
(Dict()
): Additional metadata for specific columns. This works the same way as file-levelmetadata
and must be a dictionary with string keys and values. Can be accessed from a written file by callingParquet2.metadata
on column objects.compute_statistics
(false
): Whether column statistics (minimum, maximum, number of nulls) should be computed when the file is written and stored in metadata. When read back withDataset
, the loaded columns will be wrapped in a struct allowing these statistics to be efficiently retrieved, seeVectorWithStatistics
.json_columns
(false
): Columns which should be JSON encoded. Columns with types which can be naturally encoded as JSON but which have no other supported types, that isAbstractVector
andAbstractDict
columns, will be JSON encoded regardless of the value of this argument.bson_columns
(false
): Columns which should be BSON encoded. By default, columns which need special encoding are JSON encoding, so they must be specified here to force them to be BSON.propagate_col_metadata
(true
): Whether to propagate column metadata provided by the DataAPI.jl metadata interface. Metadata set with thecolumn_metadata
argument will be merged with this with the former taking precedence.
Examples
open(filename, write=true) do io
fw = Parquet2.FileWriter(io)
Parquet2.writeiterable!(io, tbls) # write tables as separate row groups, finalization is done automatically
end
df = DataFrame(A=1:5, B=randn(5))
# use `writefile` to write in a single call
writefile(filename, df)
# write to `IO` object
io = IOBuffer()
writefile(io, df)
# write to an `AbstractVector` buffer.
v = writefile(Vector{UInt8}, df)
Parquet2.writefile
— Functionwritefile(io::IO, path, tbl; kw...)
writefile(path, tbl; kw...)
Write the Tables.jl compatible table tbl
to the IO
or the file at path
. Note that the path is used in parquet metadata, which is why it is possible to specify the path separately from the io
stream. See FileWriter
for a description of all possible arguments.
This function writes a file all in one call. Files will be written as one parquet row group per table partition. An intermediate FileWriter
object is used.
Parquet2.filelist
— Functionfilelist(ds::Dataset)
filelist(fm::FileManager)
Returns an AbstractVector
containing the paths of all files associated with the dataset.
Parquet2.showtree
— Functionshowtree([io=stdout,] ds::Dataset)
Show the "hive/drill" directory tree of the dataset. The pairs printed in this tree can be passed as arguments to append!
to append the corresponding row group to the dataset.
Base.append!
— Methodappend!(ds::Parquet2.Dataset, col=>val...; check=true)
append!(ds::Parquet2.Dataset; check=true, kw...)
Append row groups for which the columns specified by col
have the value val
. This applies only to "hive/drill" partition columns in file trees, therefore col
and val
must both be strings. The selected row groups must satisfy all passed pairs.
Alternatively, these can be passed as keyword arguments with the column names as the keys and the (string) values as the value constraints.
Examples
◖◗ showtree(ds)
Root()
├─ "A" => "1"
│ └─ "B" => "alpha"
├─ "A" => "2"
│ └─ "B" => "alpha"
└─ "A" => "3"
└─ "B" => "beta"
◖◗ append!(ds, "A"=>"2", "B"=>"alpha", verbose=true);
[ Info: appended row group from file $HOME/data/hive_fastparquet.parq/A=2/B=alpha/part.0.parquet
◖◗ append!(ds, A="3", B="alpha"); # in this case nothing is appended since now such row group exists
Base.append!
— Methodappend!(ds::Dataset, i::Integer; check=true)
Append row group number i
to the dataset. The index i
is the index of the array returned by filelist
, that is, this is equivalent to append!(ds, filelist(ds)[i])
.
Base.append!
— Methodappend!(ds::Parquet2.Dataset, col=>val...; check=true)
append!(ds::Parquet2.Dataset; check=true, kw...)
Append row groups for which the columns specified by col
have the value val
. This applies only to "hive/drill" partition columns in file trees, therefore col
and val
must both be strings. The selected row groups must satisfy all passed pairs.
Alternatively, these can be passed as keyword arguments with the column names as the keys and the (string) values as the value constraints.
Examples
◖◗ showtree(ds)
Root()
├─ "A" => "1"
│ └─ "B" => "alpha"
├─ "A" => "2"
│ └─ "B" => "alpha"
└─ "A" => "3"
└─ "B" => "beta"
◖◗ append!(ds, "A"=>"2", "B"=>"alpha", verbose=true);
[ Info: appended row group from file $HOME/data/hive_fastparquet.parq/A=2/B=alpha/part.0.parquet
◖◗ append!(ds, A="3", B="alpha"); # in this case nothing is appended since now such row group exists
Base.append!
— Methodappend!(ds::Parquet2.Dataset, p; check=true, verbose=false)
Append all row groups from the file p
to the dataset row group metadata. If check
, will check if path is a valid parquet file first. p
must be a path that was discovered during the initial construction of the dataset.
If verbose=true
an INFO
level logging message will be printed for each appended row group.
Parquet2.appendall!
— Functionappendall!(ds::Dataset; check=true)
Append all row groups to the dataset.
WARNING: Some parquet directory trees can be huge. This function does nothing to check that what you are about to do is a good idea, so use it judiciously.
Parquet2.ColumnOption
— TypeColumnOption{𝒯}
A container for a column-specific read or write option with value type 𝒯
. Contains sets of names and types for determining what option to apply to a column. Column-specific keyword arguments passed to [Dataset
] and [FileWriter
] will be converted to ColumnOption
s.
The provide arguments must be one of the following:
- A single value of the appropriate type, in which case this option will be applied to all columns.
- A
NamedTuple
the keys of which are column names and the values of which are the value to be applied to the corresponding column. Columns not listed will use the default option for that keyword argument. - An
AbstractDict
the keys of which are the column names as strings. This works analogously toNamedTuple
. - An
AbstractDict
the keys of which are types and the values of which are options to be applied to all columns with element types which are subtypes of the provided type. - A
Pair
will be treated as a dictionary with a single entry.
Constructors
ColumnOption(dict_value_or_namedtuple, default)
Users may wish to construct a ColumnOption
and pass it as an argument to set their own default.
Examples
# enable parallel page loading for *all* columns
Dataset(filename; parallel_page_loading=true)
# enable parallel page loading for column `col1`
Dataset(filename; parallel_page_loading=(col1=true,))
# columns `col1` and `col2` will be written with 2 and 3 pages respectively, else 1 page
writefile(filename; npages=Dict("col1"=>2, "col2"=>3))
# `col1` will use snappy compression, all other columns will use zstd
writefile(filename; compression_codec=Parquet2.ColumnOption((col1=:snappy), :zstd))
# All dictionary columns will be encoded as BSON
writefile(filename; bson_columns=Dict(AbstractDict=>true))
Parquet2.RowGroup
— TypeRowGroup <: ParquetTable
A piece of a parquet table. All parquet files are organized into 1 or more RowGroup
s each of which is a table in and of itself. RowGroup
satisfies the Tables.jl columnar interface. Therefore, all row groups can be used as tables just like full Dataset
s. Typically different RowGroup
s are stored in different files and each file constitutes and entire RowGroup
, though this is not enforced by the specification or Parquet2.jl. It is not expected for users to construct tables as their schema is constructed from parquet metadata.
Dataset
s are indexable collections of RowGroup
s.
Usage
ds = Dataset("/path/to/parquet")
length(ds) # gives the number of row groups
rg = ds[1] # get first row group
c = rg[1] # get first column
c = rg["column_name"] # or by name
for c ∈ rg # RowGroups are indexable collections of columns
println(name(c))
end
df = DataFrame(rg) # RowGroups are bonified columnar tables themselves
# use TableOperations.jl to load only selected columns
df1 = rg |> TableOperations.select(:col1, :col2) |> DataFrame
Parquet2.Column
— TypeColumn
Data structure for organizing metadata and loading data of a parquet column object. These columns are the segments of columns referred to by individual row groups, not necessarily the entire columns of the master table schema. As such, these will have the same type of the columns in the full table but not necessarily the same number of values.
Usage
c = rg[n] # returns nth `Column` from row group
c = rg["column_name"] # retrieve by name
Parquet2.pages!(c) # infer page schema of columns
Parquet2.name(c) # get the name of c
Parquet2.filepath(c) # get the path of the file containing c
v = Parquet2.load(c) # load column values as a lazy AbstractVector
v[:] # fully load values into memory
Parquet2.load
— Functionload(ds::Dataset, n)
Load the (complete, all RowGroup
s) column n
(integer or string) from the dataset.
load(c::Column)
load(rg::RowGroup, column_name)
load(ds::Dataset, column_name)
Deserialize values from a parquet column as an AbstractVector
object. Options for this are defined when the file containing the column is first initialized.
Column name can be either a string column name or an integer column number.
DataAPI.metadata
— Functionmetadata(col::Column; style=false)
Get the key-value metadata for the column.
metadata(col::Column, k::AbstractString[, default]; style=false)
Get the key k
from the key-value metadata for column col
. If default
is provided it will be returned if k
is not present.
metadata(ds::Dataset; style=false)
Get the auxiliary key-value metadata for the dataset.
Note that Dataset
does not support DataAPI.colmetadata
because it contains one instance of each column per row group. To access column metadata either call metadata
on Column
objects or colmetadata
on RowGroup
objects.
metadata(ds::Dataset, k::AbstractString[, default]; style=false)
Get the key k
from the key-value metadata for the dataset. If default
is provided it will be returned if k
is not present.
Parquet2.writeiterable!
— Functionwriteiterable!(fw::FileWriter, tbls)
Write each table returned by the iterable over Tables.jl compatible tables tbls
to the parquet file. The file will then be finalized so that no further data can be written to it.
Base.close
— Methodclose(ds::Dataset)
Close the Dataset
, deleting all file buffers and row groups and freeing the memory. If the buffers are memory-mapped, this will free associated file handles. Note that memory and handles are only freed once garbage collection is executed (can be forced with GC.gc()
).
Schema and Introspection
Parquet2.SchemaNode
— TypeSchemaNode
Represents a single node in a parquet schema tree. Statisfies the AbstractTrees
interface.
Parquet2.PartitionNode
— TypePartitionNode
Representation of a node in a hive parquet schema partition tree. Sastisfies the AbstractTrees interface.
Parquet2.Page
— TypePage
Object containing metadata for parquet pages. These are esesentially subsets of the data of a column. The raw data contained in the page can be accessed with view(page)
.
Parquet2.ColumnStatistics
— TypeColumnStatistics
A data structure for storing the statistics for a parquet column. The following functions are available for accessing statistics. In all cases, will return nothing
if the statistic was not included in the parquet metadata.
minimum(stats)
: The minimum value.maximum(stats)
: The maximum value.count(ismissing, stats)
: The number of missing values.ndistinct(stats)
: The number of distinct values.
Can be obtained from a Column
object with ColumnStatistics(col)
.
Parquet2.VectorWithStatistics
— TypeVectorWithStatistics{𝒯,𝒮,𝒱<:AbstractVector{𝒯}} <: AbstractVector{𝒯}
A wrapper for an AbstractVector
object which can store the following statistics:
- minimum value, accessible with
minimum(v)
- maximum value, accessible with
maximum(v)
- number of missings, accessible with
count(ismissing, v)
- number of distinct elements, accessible with
ndistinct(v)
.
Methods are provided so that the stored values are returned rather than re-computing the values when these functions are called. Note that a method is also provided for count(!ismissing, v)
so this should also be efficient.
The use_statistics
option for Dataset
controls whether columns are loaded with statistics.
Parquet2.ndistinct
— Functionndistinct(s::ColumnStatistics)
Returns the number of distinct elements in the column. nothing
if not available.
ndistinct(v::AbstractVector)
Get the number of distinct elements in v
. If v
is a VectorWithStatistics
, as returned from parquet columns when metadata is available, computation will be elided and the stored value will be used instead.
Parquet2.PageHeader
— TypePageHeader
Abstract type for parquet format page headers.
See the description of pages in the specification here
.
Parquet2.DataPageHeader
— TypeDataPageHeader <: PageHeader
Header for a page of data. This type stores metadata for either the newer DataHeaderV2
or legacy DataHeader
.
Parquet2.DictionaryPageHeader
— TypeDictionaryPageHeader <: PageHeader
Header for pages storing dictionary reference values.
Parquet2.parqtype
— Functionparqtype(t::Type; kw...)
Return the parquet type object corresponding to the provided Julia type.
The following keyword arguments should be provided for context only where appropriate
decimal_scale=0
: base 10 scale of a decimal numberdecimal_precision=3
: precision of a decimal number.bson=false
: whether serialization of dictionaries should prefer BSON to JSON.
Only one method with the signature ::Type
is defined so to avoid excessive run-time dispatch.
parqtype(s)
Gets the ParquetType
for elements of the object s
, e.g. a Column
or SchemaNode
. See this section of the parquet specification.
Parquet2.juliatype
— Functionjuliatype(col::Column)
Get the element type of the AbstractVector
the column is loaded into ignoring missings. For example, if the eltype is Union{Int,Missing}
this will return Int
.
See juliamissingtype
for the exact type.
Parquet2.juliamissingtype
— Functionjuliamissingtype(col::Column)
Returns the element type of the AbstractVector
that is returned on load(col)
.
Parquet2.nvalues
— Functionnvalues(col::Column)
Returns the number of values in the column (i.e. number of rows).
Parquet2.iscompressed
— Functioniscompressed(col::Column)
Whether the column is compressed.
Parquet2.isdictencoded
— Functionisdictencoded(col::Column)
Returns true
if all data in the column is dictionary encoded.
This will force the scanning of pages.
Parquet2.pages
— Functionpages(col::Column)
Accesses the pages of the column, loading them if they are not already loaded. See pages!
which is called by this in cases where pages are not already discovered.
Parquet2.pages!
— Functionpages!(col::Column)
Infer the binary schema of the column pages and store Page
objects that store references to data page locations. This function should typically be called only once as the objects discovered by this store all needed metadata. Calling this may invoke calls to retrieve data from the source. After calling this all data for the column is guaranteed to be stored in memory.
Internals
See Internals.