Non-Uniform Quantization Learner

Non-uniform quantization is a generalization to uniform quantization. In non-uniform quantization, the quantization points are not distributed evenly, and can be optimized via the back-propagation of the network gradients. Consequently, with the same number of bits, non-uniform quantization is more expressive to approximate the original full-precision network comparing to uniform quantization. Nevertheless, the non-uniform quantized model cannot be accelerated directly based on current deep learning frameworks, since the low-precision multiplication requires the intervals among quantization points to be equal. Therefore, the NonUniformQuantLearner can only help better compress the model.

Algorithm

NonUniformQuantLearner adopts a similar training and evaluation procedure to the UniformQuantLearner. In the training process, the quantized weights are forwarded, while in the backward pass, full precision weights are updated via the STE estimator. The major difference from uniform quantization is that the locations of quantization points are not evenly distributed, but can be optimized and initialized differently. In the following, we introduce the scheme to the update and initialization of quantization points.

Optimization the quantization points

Unlike uniform quantization, non-uniform quantization can optimize the location of quantization points dynamically during the training of the network, and thereon leads to less quantization loss. The location of quantization points can be updated by summing the gradients of weights that fall into the point (Han et.al 2015), i.e.: $$ \frac{\partial \mathcal{L}}{\partial c_k} = \sum_{i,j}\frac{\partial\mathcal{L}}{\partial w_{ij}}\frac{\partial{w_{ij}}}{\partial c_k}=\sum_{ij}\frac{\partial\mathcal{L}}{\partial{w_{ij}}}1(I_{ij}=k) $$ The following figure taken from Han et.al 2015 shows the above process of updating the clusters:

Deep Compression Algor

Initialization of quantization points

Aside from optimizing the quantization points, another helpful strategy is to properly initialize the quantization points according to the distribution of weights. PocketFlow currently supports two kinds of initialization:

  • Uniform initialization: The quantization points are initialized to be evenly distributed along the range [w_{min}, w_{max}] of that layer/bucket.
  • Quantile initialization: The quantization points are initialized to be the quantiles of full-precision weights. Comparing to uniform initialization, quantile initialization can generally lead to better performance.

Hyper-parameters

To configure NonUniformQuantLearner, users can pass the options via the TensorFlow flag interface. The available options are as follows:

Options Description
nuql_opt_mode the fine-tuning mode: [weights, clusters, both]. Default: weight
nuql_init_style the initialization of quantization point: [quantile, uniform]. Default: quantile.
nuql_weight_bits the number of bits for weight. Default: 4.
nuql_activation_bits the number of bits for activation. Default: 32.
nuql_save_quant_mode_path the save path for quantized models. Default: ./nuql_quant_models/model.ckpt
nuql_use_buckets the switch to quantize first and last layers of network. Default: False.
nuql_bucket_type two bucket type available: ['split', 'channel']. Default: channel.
nuql_bucket_size the number of bucket size for bucket type 'split'. Default: 256.
nuql_enbl_rl_agent the switch to enable RL to learn optimal bit strategy. Default: False.
nuql_quantize_all_layers the switch to quantize first and last layers of network. Default: False.
nuql_quant_epoch the number of epochs for fine-tuning. Default: 60.

Here, we provide detailed description (and some analysis) for some of the above hyper-parameters:

  • nuql_opt_mode: the mode for fine-tuning the non-uniformly quantized network, choose among [weights, clusters, both]. weight refers to only updating the network weights, while clusters refers to only updating the quantization points, and both means updating weights and quantization points simultaneously. Experimentally, we found that weight and both achieve similar performance, both of which outperform clusters.
  • nuql_init_style: the style of initialization of quantization points, currently supports [quantile, uniform]. The differences between the two strategies have been discussed earlier.
  • nuql_weight_bits: The number of bits for weight quantization. Generally, for lower bit quantization (e.g., 2 bit on CIFAR10 and 4 bit on ILSVRC_12), NonUniformQuantLearner performs much better than UniformQuantLearner. The gap becomes less when using higher bits.
  • nuql_activation_bits: The number of bits for activation quantization. Since non-uniform quantized models cannot be accelerated directly, by default we leave it as 32 bit.
  • nuql_save_quant_mode_path: the path to save the quantized model. Quantization nodes have already been inserted into the graph.
  • nuql_use_buckets: the switch to turn on the bucket. With bucketing, weights are split into multiple pieces, while the \alpha and \beta are calculated individually for each piece. Therefore, turning on the bucketing can lead to more fine-grained quantization.
  • nuql_bucket_type: the type of bucketing. Currently two types are supported: [split, channel]. split refers to that the weights of a layer are first concatenated into a long vector, and then cut it into short pieces according to uql_bucket_size. The remaining last piece is still regarded as a new piece. After quantization for each piece, the vectors are then folded back to the original shape as the quantized weights. channel refers to that weights with shape [k, k, cin, cout] in a convolutional layer are cut into cout buckets, where each bucket has the size of k * k * cin. For weights with shape [m, n] in fully connected layers, they are cut into n buckets, each of size m. In practice, bucketing with type channel can be calculated more quickly comparing to type split since there are less buckets and less computation to iterate through all buckets.
  • nuql_bucket_size: the size of buckets when using bucket type split. Generally, smaller bucket size can lead to more fine grained quantization, while more storage are required since full precision statistics (\alpha and \beta) of each bucket need to be kept.
  • nuql_quantize_all_layers: the switch to quantize the first and last layers. The first and last layers of the network are connected directly with the input and output, and are arguably more sensitive to quantization. Keeping them un-quantized can slightly increase the performance, nevertheless, if you want to accelerate the inference speed, all layers are supposed to be quantized.
  • nuql_quant_epoch: the epochs for fine-tuning a quantized network.
  • nuql_enbl_rl_agent: the switch to turn on the RL agent as hyper parameter optimizer. Details about the RL agent and its configurations are described below.

Configure the RL Agent

Similar to uniform quantization, once nuql_enbl_rl_agent==True , the RL agent will automatically search for the optimal bit allocation strategy for each layer. In order to search efficiently, the agent need to be configured properly. While here we list all the configurable hyper parameters for the agent, users can just keep the default value for most parameters, while modify only a few of them if necessary.

Options Description
nuql_equivalent_bits the number of re-allocated bits that is equivalent to non-uniform quantization without RL agent. Default: 4.
nuql_nb_rlouts the number of roll outs for training the RL agent. Default: 200.
nuql_w_bit_min the minimal number of bits for each layer. Default: 2.
nuql_w_bit_max the maximal number of bits for each layer. Default: 8.
nuql_enbl_rl_global_tune the switch to fine-tune all layers of the network. Default: True.
nuql_enbl_rl_layerwise_tune the switch to fine-tune the network layer by layer. Default: False.
nuql_tune_layerwise_steps the number of steps for layer-wise fine-tuning. Default: 300.
nuql_tune_global_steps the number of steps for global fine-tuning. Default: 2000.
nuql_tune_disp_steps the display steps to show the fine-tuning progress. Default: 100.
nuql_enbl_random_layers the switch to randomly permute layers during RL agent training. Default: True.

Detailed description can be found in Uniform Quantization, with the only difference that the prefix is changed to nuql_.

Usage Examples

Again, users should first get the model prepared. Users can either use the pre-built models in PocketFlow, or develop their customized nets following the model definition in PocketFlow (for example, resnet_at_cifar10.py) Once the model is built, the Non-Uniform Quantization Learner can be easily triggered as follows:

To quantize a ResNet-20 model for CIFAR-10 classification task with 4 bits in the local mode, use:

# quantize resnet-20 on CIFAR-10
sh ./scripts/run_local.sh nets/resnet_at_cifar10_run.py \
--learner=non-uniform \
--nuql_weight_bits=4 \
--nuql_activation_bits=4 \

To quantize a ResNet-18 model for ILSVRC_12 classification task with 8 bits in the docker mode with 4 GPUs, and allow to use the channel-wise bucketing, use:

# quantize the resnet-18 on ILSVRC-12
sh ./scripts/run_docker.sh nets/resnet_at_ilsvrc12_run.py \
-n=4 \
--learner=non-uniform \
--nuql_weight_bits=8 \
--nuql_activation_bits=8 \
--nuql_use_buckets=True \
--nuql_bucket_type=channel

To quantize a MobileNet-v1 model for ILSVRC_12 classification task with 4 bits in the seven mode with 8 GPUs, and allow the RL agent to search for the optimal bit strategy, use:

# quantize mobilenet-v1 on ILSVRC-12
sh ./scripts/run_seven.sh nets/mobilenet_at_ilsvrc12_run.py \
-n=8 \
--learner=non-uniform \
--nuql_enbl_rl_agent=True \
--nuql_equivalent_bits=4 \

References

Han S, Mao H, and Dally W J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv:1510.00149, 2015