BGZF

Blocked GNU Zip Format (BGZF) is a variant of gzip file format that uses block compression, a method that compresses data in independent blocks of content—each of which is a valid gzip file. This design is utilized widely in bioinformatics for genomic data compression. The block-based design provides efficient storage, random access with indexed queries,^[1]^[2] and parallel processing; allowing large-scale data processing.^[3]

Filename extension

.gz

Internet mediatype

application/gzip

Magic number`\x1f\x8b\x08\x04` (initial bytes, standard Gzip magic. BGZF adds specific extra fields in the header of each block.)

DevelopedbySAMtools project / HTSlib

Quick facts Filename extension, Internet media type ...

BGZF (Blocked GNU Zip Format)
Filename extension	.gz
Internet media type	application/gzip
Magic number	`\x1f\x8b\x08\x04` (initial bytes, standard Gzip magic. BGZF adds specific extra fields in the header of each block.)
Developed by	SAMtools project / HTSlib
Initial release	c. 2009 (along with SAM/BAM format specification)
Type of format	Compressed file format, Indexed file format
Container for	Commonly used for bioinformatics data like SAM, BAM, VCF records
Extended from	Gzip
Standard	https://samtools.github.io/hts-specs/SAMv1.pdf#page=13.12
Website	www.htslib.org/doc/bgzip.html

Close

The format was developed as part of SAM/BAM specification and SAMtools.^[4] It is a core component of the common BAM format (the binary version of the Sequence Alignment Map format) and is also used to compress and index Variant Call Format (VCF), FASTA, and BED files.^[5] Because each block is a standard gzip block, a BGZF file can be decompressed by any standard gzip-compatible tool, ensuring backward compatibility.^[6] A general purpose compression utility for producing BGZF files bgzip is distributed with HTSlib software library.^[5]

Uses

BGZF is widely utilized in bioinformatics for the compression of large datasets where efficient random access is a crucial requirement. Due to large sizes of next-generation sequencing data formats like SAM files,^[7] they are compressed into binary BAM format utilizing BGZF compression.^[3]^[8]

For random access, an index file is created for a BGZF-compressed file, typically using Tabix.^[9] This index stores the file offsets of the compressed blocks alongside the corresponding genomic coordinates, thus allowing a program to seek directly to the block containing the data queried, decompress only them, and retrieve the requested information, avoiding the need to process the entire file.^[9]

The format is also extensively employed for compressing variant call files (VCF) along with their associated Tabix indexes,^[9] and similarly for other substantial genomic data files such as BED, GFF/GTF, and occasionally FASTQ when indexed access is necessary.^[5] A broad range of bioinformatics software packages are equipped to read and write BGZF-compressed files; these include well-known tools like SAMtools, HTSlib, BCF/VCFtools,^[10] Picard tools, the GATK, and libraries such as Biopython.^[11]^[12] The standard command-line utility for creating BGZF-compressed files and their corresponding .gzi indexes is bgzip, which is distributed as part of HTSlib.^[6]

BGZF has been adapted for development of more efficient data-specific compression methods and algorithms leveraging its block based design.^[13]

Design schema

A BGZF file consists of a series of concatenated BGZF blocks. Each block, whether in its compressed or uncompressed state, is limited to a maximum size of 64 kilobytes. Each BGZF block is itself a fully compliant gzip archive, adhering to the specifications outlined in RFC 1952.^[14]

Each BGZF block contains a standard gzip file header with the following standard-compliant extensions:

The F.EXTRA bit in the header is set to indicate that extra fields are present.
The extra field used by BGZF uses the two subfield ID values 66 and 67 (ASCII 'BC').
The length of the BGZF extra field payload (field LEN in the gzip specification) is 2 (two bytes of payload).
The payload of the BGZF extra field is a 16-bit unsigned integer in little-endian format. This integer gives the size of the containing BGZF block minus one.

This block design allows use of an associated index file (storing offsets of each BGZF block) to fetch and decompress only the block of data that pertains to the query, thus avoiding the computational overhead of reading and decompressing all BGZF blocks.^[9]

Random access

EOF marker

End-of-file marker for BGZF enables detection of erroneously truncated files and generate warnings or errors for the user. The EOF marker block is an empty (data block of length zero) BGZF block encoded with the default zlib compression level settings, and consists of the following 28 hexadecimal bytes: 1f 8b 08 04 00 00 00 00 00 ff 06 00 42 43 02 00 1b 00 03 00 00 00 00 00 00 00 00 00 The presence of an EOF marker by itself does not signal an end of the file, however, an EOF marker present at the end of a BGZF file indicates that the immediately following physical EOF is the end of the file as intended by the program that wrote it.^[14]

Uses

Design schema

Random access

EOF marker

See also

References

Related Articles

Related Articles

"7bgzf: Replacing samtools bgzip deflation for archiving and real-time compression"

"Twelve years of SAMtools and BCFtools"

"The Sequence Alignment/Map format and SAMtools"

"HTSlib: C library for reading/writing high-throughput sequencing data"

"Tabix: fast retrieval of sequence features from generic TAB-delimited files"

"Robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework"