Adaptive Single-Pass Compression of Unbounded Integers
Date
2017-12
Authors
Hyatt, Christopher
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Due to webpage creation, data gathering and storage, and the increasing data needed for machine learning, the amount of data in the world is rapidly growing. Organizations such as Google, Wikipedia, and the National Security Agency (NSA) use techniques to convert all of this data into searchable indexes of references. These indexes are growing as quickly as the data used to create them, and are an equal subject for compression.
This research explores the ability to dynamically compress data in one pass. This, effectively improves latency and throughput. To achieve dynamic compression in one pass, the combination of Tunstall and two other compression algorithms, Elias-Delta code and a variation of Group VarInt, are used. The two resulting pairs are Variable length nibbles with Tunstall (VLNT) and Delta-Tunstall (δ-T). This is the first known attempt at compressing data using these algorithm pairs.
These compression algorithms are applied to a number of different datasets. A synthetic dataset and a Wikipedia dataset are used to test the algorithms’ ability to compress integers. The synthetic dataset is created using several probability distribution functions (PDF), such as geometric and Poisson. Meanwhile, the Wikipedia dataset is acquired from actual inverted indexes for a number of different Wikipedia search terms. VLNT and d-T are also used to compress the members of the Silesia benchmarks dataset.
VLNT and δ-T show promise as good platforms for data compression and are recommended for future research focusing on single-pass compression of unbounded datasets.
Description
Keywords
single-pass compression, unbounded integers, compression
Citation
Hyatt, C. R. (2017). Adaptive single-pass compression of unbounded integers (Unpublished thesis). Texas State University, San Marcos, Texas.