Internals
API
Parquet2.ParquetType — TypeParquetTypeDescribes a type specified by the parquet standard metadata.
Parquet2.PageBuffer — TypePageBufferRepresents a view into a byte buffer that guarantees the underlying data is a Vector{UInt8}.
Parquet2.PageIterator — TypePageIteratorObject for iterating through pages of a column. Executing the iteration is essentially binary schema discovery and may invoke reading from the data source. Normally once a full iteration has been performed Page objects are stored by the Column making future access cheaper and this object can be discarded.
Parquet2.PageLoader — TypePageLoaderObject which wraps a Column and Page for loading data. This is the object from which all parquet data beneath the metadata is ultimately loaded.
Development Notes
We badly want to get rid of this. The main reason this is not possible is that in the original DataPageHeader the length of the repetition and definition levels is not knowable a priori. This has the consequence that reading from the page is stateful, i.e. one needs to know where the data starts and this is only possible after reading the levels in the legacy format. Since it will presumably never be possible to drop support for DataPageHeader, it will presumably never be possible to eliminate this frustration.
Parquet2.BitUnpackVector — TypeBitUnpackVector{𝒯}A vector type that unpacks underlying data into values of type 𝒯 when indexed.
Parquet2.PooledVector — TypePooledVector <: AbstractVectorA simple implementation of a "pooled" (or "dictionary encoded) rank-1 array, providing read-only access. The underlying references and value pool are required to have the form naturally returned when reading from a parquet.
Parquet2.ParqRefVector — TypeParqRefVector <: AbstractVectorAn array wrapper for an AbstractVector which acts as a reference array for the wrapped vector for its dictionary encoding.
Indexing this returns a UInt32 reference, unless the underlying vector is missing at that index, in which case it returns missing.
Parquet2.decompressedpageview — FunctiondecompressedpageviewCreates the view of page data handling decompression appropriately. If DataPageHeaderV2 this must be handled carefully since the levels bytes are not compressed. For the old data page format, this simply decompresses the entire buffer.
Parquet2.bitpack — Functionbitpack(v::AbstractVector, w)
bitpack(io::IO, w)Pack the first w bits of each value of v into the bytes of a new Vector{UInt8} buffer.
Parquet2.bitpack! — Functionbitpack!(o::AbstractVector{UInt8}, a, v::AbstractVector, w::Integer)
bitpack!(io::IO, v::AbstractVector, w::Integer)Pack the first w bits of each value of v into bytes in the vector o starting from index a. If the values of v have any non-zero bits beyond the first w they will be truncated.
WARNING the bytes of o to be written to must be initialized to zero or the result may be corrupt.
Parquet2.bitmask — Functionbitmask(𝒯, α, β)
bitmask(𝒯, β)Create a bit mask of type 𝒯 <: Integer where bits α to β (inclusive) are 1 and the rest are 0, where bit 1 is the least significant bit. If only one argument is given it is taken as the end position β.
Parquet2.bitjustify — Functionbitjustify(k, α, β)Move bits α through β (inclusive) to the least significant bits of an integer of type k.
Parquet2.bitwidth — Functionbitwidth(n::Integer)Compute the width in bits needed to encode integer n, truncating leading zeros. For example, 1 has a width of 1, 3 has a width of 2, 8 has a width of 4, et cetera.
The minimum value this returns for positive inputs is 1 for safety reasons.
Parquet2.bytewidth — Functionbytewidth(n::Integer)Compute the width in bytes needed to encode integer n truncating leading zeros beyond the nearest byte boundary. For example, anything expressible as a UInt8 has a byte width of 1, anything expressible as a UInt16 has a byte width of 2, et cetera.
Parquet2.readfixed — Functionreadfixed(io, 𝒯, N, v=zero(𝒯))
readfixed(w::AbstractVector{UInt8}, 𝒯, N, i=1, v=zero(𝒯))Read a 𝒯 <: Integer from the first N bytes of io. This is for reading integers which have had their leading zeros truncated.
Parquet2.writefixed — Functionwritefixed(io::IO, x::Integer)Write the integer x using the minimal number of bytes needed to accurately represent x, i.e. by writing bytewidth(x) bytes.
Parquet2.HybridIterator — TypeHybridIteratorAn iterable object for iterating over the parquet "hybrid encoding" described here.
Each item in the collection is an AbstractVector with decoded values.
Parquet2.encodehybrid_bitpacked — Functionencodehybrid_bitpacked(io::IO, v::AbstractVector, w=bitwidth(maximum(v)); write_preface=true, additional_bytes=0)Bit-pack v and encode it to io such that it can be read with decodehybrid. This encodes all data in v as a single bitpacked run.
If write_preface the Int32 indicating the number of payload bytes will be written, with additional_bytes additional payload bytes.
WARNING Parquet's horribly confusing encoding format does not appear to support arbitrary combinations of bitpacked encoding with run-length encoding, because the number of bitpacked-values cannot in general be uniquely determined... yeah...
Parquet2.encodehybrid_rle — Functionencodehybrid_rle(io::IO, x::Integer, n::Integer; write_preface=false, additional_bytes=0)Run-length encode a sequence of n copies of x to io.
If write_preface the Int32 indicating the number of payload bytes will be written, with additional_bytes additional payload bytes.
WARNING This cannot be combined arbitrarily with encodehybrid_bitpacked, see that function's documentation.
encodehybrid_rle(io::IO, v::AbstractVector{<:Integer})Write the vector v to io using the parquet run-length encoding.
Parquet2.maxdeflevel — Functionmaxdeflevel(r::SchemaNode, p)Compute the maximum definition level for the node at path p from the root node r.
Parquet2.maxreplevel — Functionmaxreplevel(r::SchemaNode, p)Compute the maximum repetition level for the node at path p from the root node r.
Parquet2.leb128encode — Functionleb128encode(n::Unsigned)
leb128encode(io::IO, n::Unsigned)Encode the integer n as a byte array according to the LEB128 encoding scheme.
Parquet2.leb128decode — Functionleb128decode(𝒯, v, k)Decode v (from index k) to an integer of type 𝒯 <: Unsigned according to the LEB128 encoding scheme.
Returns o, j where o is the decoded value and j is the index of v after reading (i.e. the encoded byte is contained in data from k to j-1 inclusive).
leb128decode(𝒯, io)Decode v to an integer of type 𝒯 <: Unsigned according to the LEB128 encoding scheme.
Parquet2.OptionSet — TypeOptionSetAbstract type for storing options for reading or writing parquet data.
See ReadOptions and WriteOptions.
Parquet2.ReadOptions — TypeReadOptions <: OptionSetA struct containing all options relevant for reading parquet files. Specific options are documented in Dataset.
Parquet2.WriteOptions — TypeWriteOptions <: OptionSetA struct containing all options relevant for writing parquet files. Specific options are documented in FileWriter