Scientific Papers

MZPAQ: a FASTQ data compression tool | Source Code for Biology and Medicine


In this section, we present the compression results for different streams using state-of-the-art and general purpose tools. We then show the performance comparison between our approach and the other tools. The performance is presented in terms of compression-ratio, compression speed and memory usage. We also evaluate the ability of each tool to correctly compress the benchmark datasets.

Compression of FASTQ streams

Compression of identifiers and sequences

Read identifiers are typically platform specific. In many cases, read identifiers contain instrumental information in addition to their unique information, which makes identifiers more compressible than sequences and quality scores. FASTQ sequences are strings of the alphabet A, C, T and G and occasionally N for unknown bases. In order to select the best technique for these two streams, we used general purpose and FASTQ compression tools to compress the identifiers and sequence streams. Moreover, we used FASTA tools, namely Deliminate and MFCompress, on these streams. Since FASTA compression tools do not output individual compressed streams, we looked at the compression ratios for identifier and sequence fields collectively. Table 3 shows a comparison of identifier and sequence compression using the benchmark datasets.

Table 3 Compression of identifiers and sequences: Blue color represents original file size

From the results, we observe that compression ratios for identifier and sequence streams are highly variable (from 4:1 to 16:1). Gzip, bzip2, LZMA and Slimfastq did not give best or second to best result for all datasets. Leon and SCALCE each performed best on two of the datasets. Deliminate gave best compression ratios for one dataset and LFQC gave the second to best ratio for one dataset. Most importantly, we notice that MFCompress has the best ratio for the first dataset and second to best for all other benchmark datasets.

Gzip, bzip2, LZMA, Leon, Deliminate and MFCompress are able to compress all the datasets while SCALCE and Slimfastq did not work for the PacBio dataset and LFQC did not give results in two cases. Since the main goal of our study is to develop a compression scheme that works and performs best for all data types, and based on the above findings, we select MFCompress as it works for all datasets while producing best or second to best compression ratios.

Compression of quality scores

Quality scores are ASCII characters with larger alphabet size than read sequences, which makes them more difficult to compress. Each quality score has a strong correlation with a number of preceding quality scores. This correlation decreases as the distance between two quality scores increases. Furthermore, the rate of change of correlation randomly changes from one FASTQ file to another [9]. These characteristics make it challenging to code quality scores efficiently for all datasets. Therefore, the compression ratios for quality score streams are less than those of the read identifiers and sequences. Table 4 shows the performance comparison of different algorithms on quality scores. The compression ratios for quality scores is between 2:1 and 4:1. Slimfastq gives the second to best ratio for all datasets except for the PacBio dataset, for which it does not work. The results clearly indicate that LFQC is the best suitable candidate for compressing quality scores as it gives the best compression ratios for all datasets.

Table 4 Compression of Quality Scores: Blue color represents original file size

MZPAQ compression performance

In this section, we compare the performance of MZPAQ against several state-of-the-art FASTQ compression tools as well as general-purpose compression tools. The methods are compared based on compression ratio, compression speed and memory usage during compression. The comparison also includes the ability of the tool to produce exact replica of the original file after decompression.

Compression ratio

The ratio between the size of the original and the compressed files is calculated for each dataset using all the compression tools. Table 5 shows the performance of MZPAQ relative to other evaluated tools in terms of compression ratio. The results clearly indicate that MZPAQ achieves the highest compression ratios compared to all the other tools for all datasets. LFQC achieves the second to best compression ratios for smaller file sizes; however, it does not work for larger datasets. All domain-specific tools performed better than general-purpose tools, except for LZMA, which did not work on PacBio data.

Table 5 Compression ratios of evaluated tools

Compression speed

Compression speed is the number of compressed MB per second. The decompression speed is computed similarly. In order to conduct the comparison, we run all the tools in single thread mode to allow for direct comparison between all the tools, as some of them do not support multi-threading. Table 6 shows the compression speed performance of the compared algorithms in MB/s. Slimfastq is the fastest tool and provides maximum compression speed for all cases except in the case of PacBio data, which it does not support. LFQC is the slowest for all the datasets it supports. In case of decompression speed. We can see from the results shown in Table 7 that gzip outperformes all the evaluated tools, decompressing at over 45 MB per second for all datasets. We further notice that general-purpose tools have faster decompression than compression speeds, particularly LZMA. While faster compression/decompression is favorable, the speed may be achieved at the cost of the compression ratio.

Table 6 Compression Speed of evaluated tools
Table 7 Decompression speed of evaluated tools

Memory usage

Memory usage refers to the maximum number of memory bytes required by an algorithm during compression or decompression, it represents the minimum memory that should be available for successful execution of a program. In general, memory usage varies with the type of datasets. Tables 8 and 9 show the maximum memory requirements for compression and decompression, respectively. The results show that LZMA requires 10 times more memory for compression as compared to decompression. Leon uses almost two times more memory for compression than decompression. In all cases, gzip requires the least amount of memory.

Table 8 Compression memory usage of evaluated tools
Table 9 Decompression memory usage of evaluated tools



Source link