Cost Efficiency @ Scale in Big Data File Format

2022-01-25 UberBlog 0
Featured image for Cost Efficiency @ Scale in Big Data File Format
Figure 1: Apache Parquet File Format Structure
Figure 2: Space savings when translating to ZSTD
Query
Q1546,077652,543562,616
Q2870,472639,184213,914
Q3240,781353,926191,614
Q4132,490271,814 93,082
Q5337,208380,638 109,012
Figure 3: Query Performance comparison among ZSTD/SNAPPY/GZIP
Figure 4: The reduced size in percentage vs. with compress levels after translating the compression from GZIP to ZSTD
Figure 5: The write time vs. compress levels after translating the compression from GZIP to ZSTD
Figure 6: The read time vs. compression levels after translating the compression from GZIP to ZSTD
Figure 7: Using Column Pruning Tool to translate tables
Figure 8: Size reduction comparison for different columns sorting