API
Parquet2.BitUnpackVectorParquet2.ColumnParquet2.ColumnOptionParquet2.ColumnStatisticsParquet2.DataPageHeaderParquet2.DatasetParquet2.DictionaryPageHeaderParquet2.FileWriterParquet2.HybridIteratorParquet2.OptionSetParquet2.PageParquet2.PageBufferParquet2.PageHeaderParquet2.PageIteratorParquet2.PageLoaderParquet2.ParqRefVectorParquet2.ParquetTypeParquet2.PartitionNodeParquet2.PooledVectorParquet2.ReadOptionsParquet2.RowGroupParquet2.SchemaNodeParquet2.VectorWithStatisticsParquet2.WriteOptionsBase.append!Base.append!Base.append!Base.append!Base.closeDataAPI.metadataParquet2.appendall!Parquet2.bitjustifyParquet2.bitmaskParquet2.bitpackParquet2.bitpack!Parquet2.bitwidthParquet2.bytewidthParquet2.decompressedpageviewParquet2.encodehybrid_bitpackedParquet2.encodehybrid_rleParquet2.filelistParquet2.iscompressedParquet2.isdictencodedParquet2.juliamissingtypeParquet2.juliatypeParquet2.leb128decodeParquet2.leb128encodeParquet2.loadParquet2.maxdeflevelParquet2.maxreplevelParquet2.ndistinctParquet2.nvaluesParquet2.pagesParquet2.pages!Parquet2.parqtypeParquet2.readfileParquet2.readfixedParquet2.showtreeParquet2.writefileParquet2.writefixedParquet2.writeiterable!
Basic Usage
Parquet2.Dataset — TypeDataset <: ParquetTableA complete parquet dataset created from top-level parquet metadata. Each Dataset is an indexable collection of RowGroups each of which is a Tables.jl compatible columnar table in its own right. The Dataset is a Tables.jl compatible columnar table consisting of (lazily by default) concatenated RowGroups. A Dataset can consist of data in any number of files depending on the directory structure of the referenced parquet.
Constructors
Dataset(fm::FileManager; kw...)
Dataset(p::AbstractPath; kw...)
Dataset(v::AbstractVector{UInt8}; kw...)
Dataset(io::IO; kw...)
Dataset(str::AbstractString; kw...)Arguments
fm: AFileManagerobject describing a set of files to be loaded.p: Path to main metadata file or directory containing a_metadatafile. Loading behavior will depend on the type of path provided.v: An in-memory (or memory mapped) byte buffer.io: AnIOobject from which data can be loaded.str: File or directory path as a string. Converted toAbstractPathwithPath(str).
Keyword Arguments
The following keyword arguments are applicable for the dataset as a whole:
support_legacy(true): Some parquet writers take bizarre liberties with the metadata, in particular many JVM-based writers use a specializedUInt96encoding of timestamps even though this is not described by the metadata. When this option isfalsethe metadata will be interpreted strictly.use_mmap(true): Whether to use memory mapping for reading the file. Only applicable for files on the local file system. In some cases enabling this can drastically increase read performance.mmap_shared(true): Whether memory mapped buffers can be shared with other processes. See documentation forMmap.mmap.preload(false): Whether all data should be fetched on constructing theDatasetregardless of the above options.load_initial(nothing): Whether theRowGroups should be eagerly loaded into memory. Ifnothing, this will be done only for parquets consisting of a single file.parallel_column_loading: Whether columns should be loaded using thread-based parallelism. Ifnothing, this is true as long as Julia has multiple threads available to it.
The following keyword arguments are applicable to specific columns. These can be passed either as a single value, a NamedTuple, AbstractDict or ColumnOption. See ColumnOption for details.
allow_string_copying(false): Whether strings will be copied. Iffalsea reference to the underlying data buffer needs to be maintained, meaning it can't be garbage collected to free up memory. Note also that there will potentially be a large number of references stored in the output colun if this isfalse, so setting this totruereduces garbage collector overhead.lazy_dictionary(true): Whether output columns will use a Julia categorical array representation which in some cases can ellide a large number of allocations.parallel_page_loading(false): Whether data pages in the column should be loaded in parallel. This comes with some additional overhead including an extra iteration over the entire page buffer, so it is of dubious benefit to turn this on, but it may be helpful in cases in which there is a large number of pages.use_statistics(false): Whether statistics included in the metadata will be used in the loaded columnAbstractVectors so that statistics can be efficiently retrieved rather than being re-computed. Note that if this istruethis will only be done for columns for which statistics are available. Otherwise, statistics can be retrieved withColumnStatistics(col).eager_page_scanning(true): It is not in general possible to infer all page metadata without iterating over the columns entire data buffer. This can be elided, but doing so limits what can be done to accommodate data loaded from the column. Turning this option off will reduce the overhead of loading metadata for the column but may increase the cost of allocating the output. Iffalsespecialized string and dictionary outputs will not be used (loading the column will be maximally allocating).
Usage
ds = Dataset("/path/to/parquet")
ds = Dataset(p"s3://path/to/parquet") # understands different path types
length(ds) # gives number of row groups
rg = ds[1] # index to get row groups
for rg ∈ ds # is an indexable, iterable collection of row groups
println(rg)
end
df = DataFrame(ds) # Tables.jl compatible, is concatenation of all row groups
# use TableOperations.jl to load only selected columns
df = ds |> TableOperations.select(:col1, :col2) |> DataFrameParquet2.readfile — Functionreadfile(filename; kw...)
readfile(io::IO; kw...)An alias for Dataset. All arguments are the same, so see those docs.
This function is provided for consistency with the writefile function.
Parquet2.FileWriter — TypeFileWriterData structure holding metadata inferred during the process of writing a parquet file.
A full table can be written with writetable!, for a more detailed example, see below.
Constructors
FileWriter(io, path; kw...)
FileWriter(path; kw...)Arguments
io: theIOobject to which data will be written.path: the path of the file being written. This is used in parquet metadata which is why it is possible to specify the path separately from the IO-stream.
Keyword Arguments
The following arguments are relevant for the entire file:
metadata(Dict()): Additional metadata to append at file-level. Must provide anAbstractDict, the keys and values must both be strings. This can be accessed from a written file withParquet2.metadata.propagate_table_metadata(true): Whether to propagate table metadata provided by the DataAPI.jl metadata interface for tables written to this file. Iftrueand multiple tables are written, the metadata will be merged. If this is undesirable users should set this tofalseand set viametadatainstead. Themetadataargument above will be merged with table metadata (with metadata from the option taking precedence).
The following arguments apply to specific columns and can be provided as a single value, NamedTuple, AbstractDict or ColumnOption. See ColumnOption for details.
npages(1): The number of pages to write. Some parquet readers are more efficient at reading multiple pages for large numbers of columns, but for the most part there's no reason to change this.compression_codec(:snappy): Compression codec to use. Available options are:uncompressed,:snappy,:gzip,:brotli, and:zstd.column_metadata(Dict()): Additional metadata for specific columns. This works the same way as file-levelmetadataand must be a dictionary with string keys and values. Can be accessed from a written file by callingParquet2.metadataon column objects.compute_statistics(false): Whether column statistics (minimum, maximum, number of nulls) should be computed when the file is written and stored in metadata. When read back withDataset, the loaded columns will be wrapped in a struct allowing these statistics to be efficiently retrieved, seeVectorWithStatistics.json_columns(false): Columns which should be JSON encoded. Columns with types which can be naturally encoded as JSON but which have no other supported types, that isAbstractVectorandAbstractDictcolumns, will be JSON encoded regardless of the value of this argument.bson_columns(false): Columns which should be BSON encoded. By default, columns which need special encoding are JSON encoding, so they must be specified here to force them to be BSON.propagate_col_metadata(true): Whether to propagate column metadata provided by the DataAPI.jl metadata interface. Metadata set with thecolumn_metadataargument will be merged with this with the former taking precedence.
Examples
open(filename, write=true) do io
fw = Parquet2.FileWriter(io)
Parquet2.writeiterable!(io, tbls) # write tables as separate row groups, finalization is done automatically
end
df = DataFrame(A=1:5, B=randn(5))
# use `writefile` to write in a single call
writefile(filename, df)
# write to `IO` object
io = IOBuffer()
writefile(io, df)
# write to an `AbstractVector` buffer.
v = writefile(Vector{UInt8}, df)Parquet2.writefile — Functionwritefile(io::IO, path, tbl; kw...)
writefile(path, tbl; kw...)Write the Tables.jl compatible table tbl to the IO or the file at path. Note that the path is used in parquet metadata, which is why it is possible to specify the path separately from the io stream. See FileWriter for a description of all possible arguments.
This function writes a file all in one call. Files will be written as one parquet row group per table partition. An intermediate FileWriter object is used.
Parquet2.filelist — Functionfilelist(ds::Dataset)
filelist(fm::FileManager)Returns an AbstractVector containing the paths of all files associated with the dataset.
Parquet2.showtree — Functionshowtree([io=stdout,] ds::Dataset)Show the "hive/drill" directory tree of the dataset. The pairs printed in this tree can be passed as arguments to append! to append the corresponding row group to the dataset.
Base.append! — Methodappend!(ds::Parquet2.Dataset, col=>val...; check=true)
append!(ds::Parquet2.Dataset; check=true, kw...)Append row groups for which the columns specified by col have the value val. This applies only to "hive/drill" partition columns in file trees, therefore col and val must both be strings. The selected row groups must satisfy all passed pairs.
Alternatively, these can be passed as keyword arguments with the column names as the keys and the (string) values as the value constraints.
Examples
◖◗ showtree(ds)
Root()
├─ "A" => "1"
│ └─ "B" => "alpha"
├─ "A" => "2"
│ └─ "B" => "alpha"
└─ "A" => "3"
└─ "B" => "beta"
◖◗ append!(ds, "A"=>"2", "B"=>"alpha", verbose=true);
[ Info: appended row group from file $HOME/data/hive_fastparquet.parq/A=2/B=alpha/part.0.parquet
◖◗ append!(ds, A="3", B="alpha"); # in this case nothing is appended since now such row group existsBase.append! — Methodappend!(ds::Dataset, i::Integer; check=true)Append row group number i to the dataset. The index i is the index of the array returned by filelist, that is, this is equivalent to append!(ds, filelist(ds)[i]).
Base.append! — Methodappend!(ds::Parquet2.Dataset, col=>val...; check=true)
append!(ds::Parquet2.Dataset; check=true, kw...)Append row groups for which the columns specified by col have the value val. This applies only to "hive/drill" partition columns in file trees, therefore col and val must both be strings. The selected row groups must satisfy all passed pairs.
Alternatively, these can be passed as keyword arguments with the column names as the keys and the (string) values as the value constraints.
Examples
◖◗ showtree(ds)
Root()
├─ "A" => "1"
│ └─ "B" => "alpha"
├─ "A" => "2"
│ └─ "B" => "alpha"
└─ "A" => "3"
└─ "B" => "beta"
◖◗ append!(ds, "A"=>"2", "B"=>"alpha", verbose=true);
[ Info: appended row group from file $HOME/data/hive_fastparquet.parq/A=2/B=alpha/part.0.parquet
◖◗ append!(ds, A="3", B="alpha"); # in this case nothing is appended since now such row group existsBase.append! — Methodappend!(ds::Parquet2.Dataset, p; check=true, verbose=false)Append all row groups from the file p to the dataset row group metadata. If check, will check if path is a valid parquet file first. p must be a path that was discovered during the initial construction of the dataset.
If verbose=true an INFO level logging message will be printed for each appended row group.
Parquet2.appendall! — Functionappendall!(ds::Dataset; check=true)Append all row groups to the dataset.
WARNING: Some parquet directory trees can be huge. This function does nothing to check that what you are about to do is a good idea, so use it judiciously.
Parquet2.ColumnOption — TypeColumnOption{𝒯}A container for a column-specific read or write option with value type 𝒯. Contains sets of names and types for determining what option to apply to a column. Column-specific keyword arguments passed to [Dataset] and [FileWriter] will be converted to ColumnOptions.
The provide arguments must be one of the following:
- A single value of the appropriate type, in which case this option will be applied to all columns.
- A
NamedTuplethe keys of which are column names and the values of which are the value to be applied to the corresponding column. Columns not listed will use the default option for that keyword argument. - An
AbstractDictthe keys of which are the column names as strings. This works analogously toNamedTuple. - An
AbstractDictthe keys of which are types and the values of which are options to be applied to all columns with element types which are subtypes of the provided type. - A
Pairwill be treated as a dictionary with a single entry.
Constructors
ColumnOption(dict_value_or_namedtuple, default)Users may wish to construct a ColumnOption and pass it as an argument to set their own default.
Examples
# enable parallel page loading for *all* columns
Dataset(filename; parallel_page_loading=true)
# enable parallel page loading for column `col1`
Dataset(filename; parallel_page_loading=(col1=true,))
# columns `col1` and `col2` will be written with 2 and 3 pages respectively, else 1 page
writefile(filename; npages=Dict("col1"=>2, "col2"=>3))
# `col1` will use snappy compression, all other columns will use zstd
writefile(filename; compression_codec=Parquet2.ColumnOption((col1=:snappy), :zstd))
# All dictionary columns will be encoded as BSON
writefile(filename; bson_columns=Dict(AbstractDict=>true))Parquet2.RowGroup — TypeRowGroup <: ParquetTableA piece of a parquet table. All parquet files are organized into 1 or more RowGroups each of which is a table in and of itself. RowGroup satisfies the Tables.jl columnar interface. Therefore, all row groups can be used as tables just like full Datasets. Typically different RowGroups are stored in different files and each file constitutes and entire RowGroup, though this is not enforced by the specification or Parquet2.jl. It is not expected for users to construct tables as their schema is constructed from parquet metadata.
Datasets are indexable collections of RowGroups.
Usage
ds = Dataset("/path/to/parquet")
length(ds) # gives the number of row groups
rg = ds[1] # get first row group
c = rg[1] # get first column
c = rg["column_name"] # or by name
for c ∈ rg # RowGroups are indexable collections of columns
println(name(c))
end
df = DataFrame(rg) # RowGroups are bonified columnar tables themselves
# use TableOperations.jl to load only selected columns
df1 = rg |> TableOperations.select(:col1, :col2) |> DataFrameParquet2.Column — TypeColumnData structure for organizing metadata and loading data of a parquet column object. These columns are the segments of columns referred to by individual row groups, not necessarily the entire columns of the master table schema. As such, these will have the same type of the columns in the full table but not necessarily the same number of values.
Usage
c = rg[n] # returns nth `Column` from row group
c = rg["column_name"] # retrieve by name
Parquet2.pages!(c) # infer page schema of columns
Parquet2.name(c) # get the name of c
Parquet2.filepath(c) # get the path of the file containing c
v = Parquet2.load(c) # load column values as a lazy AbstractVector
v[:] # fully load values into memoryParquet2.load — Functionload(ds::Dataset, n)Load the (complete, all RowGroups) column n (integer or string) from the dataset.
load(c::Column)
load(rg::RowGroup, column_name)
load(ds::Dataset, column_name)Deserialize values from a parquet column as an AbstractVector object. Options for this are defined when the file containing the column is first initialized.
Column name can be either a string column name or an integer column number.
DataAPI.metadata — Functionmetadata(col::Column; style=false)Get the key-value metadata for the column.
metadata(col::Column, k::AbstractString[, default]; style=false)Get the key k from the key-value metadata for column col. If default is provided it will be returned if k is not present.
metadata(ds::Dataset; style=false)Get the auxiliary key-value metadata for the dataset.
Note that Dataset does not support DataAPI.colmetadata because it contains one instance of each column per row group. To access column metadata either call metadata on Column objects or colmetadata on RowGroup objects.
metadata(ds::Dataset, k::AbstractString[, default]; style=false)Get the key k from the key-value metadata for the dataset. If default is provided it will be returned if k is not present.
Parquet2.writeiterable! — Functionwriteiterable!(fw::FileWriter, tbls)Write each table returned by the iterable over Tables.jl compatible tables tbls to the parquet file. The file will then be finalized so that no further data can be written to it.
Base.close — Methodclose(ds::Dataset)Close the Dataset, deleting all file buffers and row groups and freeing the memory. If the buffers are memory-mapped, this will free associated file handles. Note that memory and handles are only freed once garbage collection is executed (can be forced with GC.gc()).
Schema and Introspection
Parquet2.SchemaNode — TypeSchemaNodeRepresents a single node in a parquet schema tree. Statisfies the AbstractTrees interface.
Parquet2.PartitionNode — TypePartitionNodeRepresentation of a node in a hive parquet schema partition tree. Sastisfies the AbstractTrees interface.
Parquet2.Page — TypePageObject containing metadata for parquet pages. These are esesentially subsets of the data of a column. The raw data contained in the page can be accessed with view(page).
Parquet2.ColumnStatistics — TypeColumnStatisticsA data structure for storing the statistics for a parquet column. The following functions are available for accessing statistics. In all cases, will return nothing if the statistic was not included in the parquet metadata.
minimum(stats): The minimum value.maximum(stats): The maximum value.count(ismissing, stats): The number of missing values.ndistinct(stats): The number of distinct values.
Can be obtained from a Column object with ColumnStatistics(col).
Parquet2.VectorWithStatistics — TypeVectorWithStatistics{𝒯,𝒮,𝒱<:AbstractVector{𝒯}} <: AbstractVector{𝒯}A wrapper for an AbstractVector object which can store the following statistics:
- minimum value, accessible with
minimum(v) - maximum value, accessible with
maximum(v) - number of missings, accessible with
count(ismissing, v) - number of distinct elements, accessible with
ndistinct(v).
Methods are provided so that the stored values are returned rather than re-computing the values when these functions are called. Note that a method is also provided for count(!ismissing, v) so this should also be efficient.
The use_statistics option for Dataset controls whether columns are loaded with statistics.
Parquet2.ndistinct — Functionndistinct(s::ColumnStatistics)Returns the number of distinct elements in the column. nothing if not available.
ndistinct(v::AbstractVector)Get the number of distinct elements in v. If v is a VectorWithStatistics, as returned from parquet columns when metadata is available, computation will be elided and the stored value will be used instead.
Parquet2.PageHeader — TypePageHeaderAbstract type for parquet format page headers.
See the description of pages in the specification here.
Parquet2.DataPageHeader — TypeDataPageHeader <: PageHeaderHeader for a page of data. This type stores metadata for either the newer DataHeaderV2 or legacy DataHeader.
Parquet2.DictionaryPageHeader — TypeDictionaryPageHeader <: PageHeaderHeader for pages storing dictionary reference values.
Parquet2.parqtype — Functionparqtype(t::Type; kw...)Return the parquet type object corresponding to the provided Julia type.
The following keyword arguments should be provided for context only where appropriate
decimal_scale=0: base 10 scale of a decimal numberdecimal_precision=3: precision of a decimal number.bson=false: whether serialization of dictionaries should prefer BSON to JSON.
Only one method with the signature ::Type is defined so to avoid excessive run-time dispatch.
parqtype(s)Gets the ParquetType for elements of the object s, e.g. a Column or SchemaNode. See this section of the parquet specification.
Parquet2.juliatype — Functionjuliatype(col::Column)Get the element type of the AbstractVector the column is loaded into ignoring missings. For example, if the eltype is Union{Int,Missing} this will return Int.
See juliamissingtype for the exact type.
Parquet2.juliamissingtype — Functionjuliamissingtype(col::Column)Returns the element type of the AbstractVector that is returned on load(col).
Parquet2.nvalues — Functionnvalues(col::Column)Returns the number of values in the column (i.e. number of rows).
Parquet2.iscompressed — Functioniscompressed(col::Column)Whether the column is compressed.
Parquet2.isdictencoded — Functionisdictencoded(col::Column)Returns true if all data in the column is dictionary encoded.
This will force the scanning of pages.
Parquet2.pages — Functionpages(col::Column)Accesses the pages of the column, loading them if they are not already loaded. See pages! which is called by this in cases where pages are not already discovered.
Parquet2.pages! — Functionpages!(col::Column)Infer the binary schema of the column pages and store Page objects that store references to data page locations. This function should typically be called only once as the objects discovered by this store all needed metadata. Calling this may invoke calls to retrieve data from the source. After calling this all data for the column is guaranteed to be stored in memory.
Internals
See Internals.