Performance

In this documentation, we present evaluation results for applying various model compression methods for ResNet and MobileNet models on the ImageNet classification task, including channel pruning, weight sparsification, and uniform quantization.

We adopt ChannelPrunedLearner to shrink the number of channels for convolutional layers to reduce the computation complexity. Instead of using the same pruning ratio for all layers, we utilize the DDPG algorithm as the RL agent to iteratively search for the optimal pruning ratio of each layer. After obtaining the optimal pruning ratios, group fine-tuning is adopted to further improve the compressed model's accuracy, as demonstrated below:

Model	Pruning Ratio	Uniform	RL-based	RL-based + Group Fine-tuning
MobileNet-v1	50%	66.5%	67.8% (+1.3%)	67.9% (+1.4%)
MobileNet-v1	60%	66.2%	66.9% (+0.7%)	67.0% (+0.8%)
MobileNet-v1	70%	64.4%	64.5% (+0.1%)	64.8% (+0.4%)
Mobilenet-v1	80%	61.4%	61.4% (+0.0%)	62.2% (+0.8%)

Note: The original uncompressed MobileNet-v1's top-1 accuracy is 70.89%.

We adopt WeightSparseLearner to introduce the sparsity constraint so that a large portion of model weights can be removed, which leads to smaller model and lower FLOPs for inference. Comparing with the original algorithm proposed in (Zhu & Gupta, 2017), we also incorporate network distillation and reinforcement learning algorithms to further improve the compressed model's accuracy, as shown in the table below:

Model	Sparsity	(Zhu & Gupta, 2017)	RL-based
MobileNet-v1	50%	69.5%	70.5% (+1.0%)
MobileNet-v1	75%	67.7%	68.5% (+0.8%)
MobileNet-v1	90%	61.8%	63.4% (+1.6%)
MobileNet-v1	95%	53.6%	56.8% (+3.2%)

Note: The original uncompressed MobileNet-v1's top-1 accuracy is 70.89%.

We adopt UniformQuantTFLearner to uniformly quantize model weights from 32-bit floating-point numbers to 8-bit fixed-point numbers. The resulting model can be converted into the TensorFlow Lite format for deployment on mobile devices. In the following two tables, we show that 8-bit quantized models can be as accurate as (or even better than) the original 32-bit ones, and the inference time can be significantly reduced after quantization.

Model	Top-1 Acc. (32-bit)	Top-5 Acc. (32-bit)	Top-1 Acc. (8-bit)	Top-5 Acc. (8-bit)
ResNet-18	70.28%	89.38%	70.31% (+0.03%)	89.40% (+0.02%)
ResNet-50	75.97%	92.88%	76.01% (+0.04%)	92.87% (-0.01%)
MobileNet-v1	70.89%	89.56%	71.29% (+0.40%)	89.79% (+0.23%)
MobileNet-v2	71.84%	90.60%	72.26% (+0.42%)	90.77% (+0.17%)

Model	Hardware	CPU	Time (32-bit)	Time (8-bit)	Speed-up
MobileNet-v1	XiaoMi 8 SE	Snapdragon 710	156.33	62.60	2.50 $\times$
MobileNet-v1	XiaoMI 8	Snapdragon 845	124.53	56.12	2.22 $\times$
MobileNet-v1	Huawei P20	Kirin 970	152.54	68.43	2.23 $\times$
MobileNet-v2	XiaoMi 8 SE	Snapdragon 710	153.18	57.55	2.66 $\times$
MobileNet-v2	XiaoMi 8	Snapdragon 845	120.59	49.04	2.46 $\times$
MobileNet-v2	Huawei P20	Kirin 970	226.61	61.38	3.69 $\times$

All the reported time are in milliseconds.