Reducing Training Cost and Improving Inference Speed Through Neural Network Compression
As AI models have become integral to many software applications used in everyday life, the need for ways to run these computationally intensive applications on mobile and edge devices has grown. To help solve these problems, a new research area of neural network compression has emerged. Techniques like quantization, pruning, and model distillation have become standard. However, these methods have several drawbacks. Many of these techniques require specialized hardware for inference, reduce robustness to adversarial examples as well as amplify existing model biases, and require significant retraining done in a time-consuming iterative process. This dissertation explores several shortcomings in model compression, how to address them, and ultimately provides a simple, repeatable recipe for creating high-quality neural network models for inference. It shows that model pruning is not a true compression process, and in fact, pruning causes model representations to change such that they are as different as a new model trained at random. It explores how pruning can cause unwanted effects of pruning and how knowledge distillation can be used to mitigate these effects. It demonstrates how model compression for more accurate fidelity to the original can be achieved while also deconstructing it into a highly efficient and parallelized process by replacing sections of the model in a block-wise fashion. Finally, it examines how knowledge distillation can be used during the training process such that it both improves training efficiency, amortizes the cost of hyper-parameter searchers, and can provide state-of-the-art compression results.
machine learning, deep learning, neural network, compression, pruning, distillation, knowledge distillation
Blakeney, C. (2023). Reducing training cost and improving inference speed through neural network compression (Unpublished dissertation). Texas State University, San Marcos, Texas.