Image Segmentation II
This is the second and final part of our series on image segmentation. In the first part we did focus on different encoder-decoder models for medical and biomedical image segmentation. Now we want to draw you attention to convolutional models with graphical models, dilated (atrous) convolutional models, and the DeepLab Family.
2. Convolutional Models With Graphical Models
2.a) Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs
The advantages of DeepLab systems provides some advantages in terms of speed, accuracy, and simplicity. Chen et al. [10] re-designed the ImageNet-pretrained 16-layer classification network VGG-16 [11] to get an efficient dense feature extractor for a dense semantic image segmentation system. They replace the fully connected layers into convolutional layers and skip sub-sampling after the last two max-pooling layers. Lastly, the replace the 1000-way classifier with a 21-way one in the last layer of VGG-16 network. They use a loss function of sum of cross-entropy terms for the spatial positions in the output map and optimize it with respect to the weights through the layers by the standard SGD [12] Moreover, dense score is computed by controlling the network’s receptive field size. VGG-16 net has receptive field with the size of 224×224 and 404×404 pixels, and the first fully connected layer has 4096 filters of large 7×7 spatial size. However, they focus of spatially subsampling the first layer to 4×4 or 3×3 spatial size to reduce the receptive field of the network down to 128×128 or 308×308. This approach also reduces the computation time for the fully connected layer by 2-3 times. They demonstrate Caffe-based implementation which reduces the number of channels at the fully connected layers from 4096 to 1024.
In deeper models with multiple max-pooling layers, while providing a success on classification accuracy, increased invariance and large receptive fields localization accuracy endangers by the models’ increased invariance and large receptive fields. The method on this paper produces accurate semantic segmentations by handling this localization challenge and explores a multi-scale prediction method to increase the boundary localization accuracy.
A two-layer MLP is added to the input and output of each of the first four max-pooling layers; the first layer of MLP has 128 3×3 convolutional filters and the second layer has 128 1×1 convolutional filters. The feature map is concatenated to the feature map of the main network, this result is then fed into the SoftMax layer. While adjusting the newly added weights, other parameters are kept to the values learnt from the method described above.
Dataset consists of 20 foreground object classes and one background class from PASCAL VOC 2012 segmentation benchmark [13]. The number of images is 1464 for training, 1449 for validation and 1456 for test. Additionally, it is augmented by extra notations to extend it up to 10582 training images, and the performance is measured with pixel intersection over-union (IOU). Comparison the result with other models shows an improvement of employing a fully connected CRF on accurately capturing intricate object boundaries.
2.b) SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation
SegNet is designed to be an efficient model for pixel-wise semantic segmentation by Badrinarayanan et al. [14].
As seen in Figure 4, the SegNet architecture is made of not fully connected layers but convolutional layers. It consists of an encoder network and a decoder network which are followed by a pixel-wise classification layer at the end. The encoder network has 13 convolutional layers which are identical to the first 13 convolutional layers of VGG16 network [11]. They removed the fully connected layers of VGG16 network to make the SegNet encoder significantly smaller in order to train it easier and to retain higher resolution feature maps at the output of the deepest encoder. Each encoder produces a set of feature map by performing convolution with a filter bank. After batch normalization and a rectified linear, non-linearity, max pooling with 2×2 window and stride 2 is applied, the output is then sub-sampled by a factor of 2.
For each encoder layer, there is a corresponding decoder layer. The decoder network consists of 13 layers and the final decoder output is fed to a multi-class SoftMax classifier which produces the class probability for each pixel. The decoder in the network is up-samples the input feature map from encoder network by memorized max-pooling indices coming from the corresponding encoder feature map to produce sparse feature map. After, the feature maps are convolved by a decoder filter bank to produce dense feature maps. After applying batch normalization to each of the maps, the high-dimensional feature representation is fed to the trainable SoftMax classifier to classify each pixels independently. At the end, a K channel image of probabilities is gained where K is the number of classes.
There are some differences with DeconvNet [15] and U-Net [16] although they share a similar architecture to SegNet. DeconvNet has a much larger parametrization than SegNet, which makes it need more computational resourses and difficult to train end-to-end due to fully connected layers. Further, U-Net does not reuse pooling indices while SegNet uses all the pre-trained convolutional layer weights from VGG-net as pretrained weights.
They used the CamVid road scenes dataset which consists of 367 training and 233 test RGB images of day and dusk scenes with the resolution of 360×480. There are 11 classes such as road, building, cars, pedestrians, signs, poles, sidewalk etc. They applied local contrast normalization [17] to the RGB input images, initialized the encoder and decoder weights using the technique from [18] and used stochastic gradient descent (SGD) with a fixed learning rate of 0.1 and momentum of 0.9 [19] The cross-entropy loss [20] is used as the objective function of the training.
3. Dilated (Atrous) Convolutional Models and DeepLab Family
Atrous convolution (also known as dilated convolution) makes the network be able to control the resolution at which responses are computed by DCNNs without requiring extra parameters. A dilation rate is introduced as another parameter to convolutional layer, and it defines a spacing between the weights and the kernel to enlarge the receptive field without increment in computational cost. This method is popular in real-time segmentation studies with DeepLab family [21] multi-scale context aggregation [22].
DeepLabv1 [23] and DeepLabv2 [24] are for the image segmentation approaches with some advantages against other methods. Firstly, it decreases resolution caused by max-pooling and striding in the network. Atrous Spatial Pyramid Pooling (ASPP) captures objects and the image context at multiple scales together by exploring an incoming convolutional feature layer, thus it influentially segments objects at multiple scales. Additionally, methods from deep CNNs and probabilistic graphical models can be combined to improve the localization of object boundaries.
In [25], DeepLabv3 is introduced, which is a combination of cascaded and parallel modules of atrous convolutions. They group the parallel convolution modules in the ASPP, process all outputs by 1×1 convolution to create a final output with logits for each pixel. DeepLabv3+ [26], extends DeepLabv3 [27] by adding a decoder to the DeepLabv3 architecture in order to recover the object boundaries. DeepLabv3+ consists of an encoder-decoder with atrous separable convolution, spatial convolution, and pointwise convolution. The rich semantic information is encoded in the output of DeepLabv3, with atrous convolution allowing one to control the density of the encoder features, depending on the budget of computation resources. Furthermore, the decoder module allows detailed object boundary recovery.
3.a) Rethinking Atrous Convolution for Semantic Image Segmentation
DCNNs reduce feature resolution caused by consecutive pooling operations and convolution striding, thus the model can learn abstract feature representations. In [28], some advantages of dilated convolutional models over DCNNs are pointed out. Atrous convolution is capable to extract dense feature maps by removing down-sampling operations on the last layers for DCNNs, and up-sampling the corresponding filter kernels. In addition, by atrous convolution, the resolution of feature responses calculated by DCNNs can be controlled without extra parameters.
Chen et al [28] propose atrous convolution to enlarge the field of filters to incorporate multi-scale context if cascaded modules and spatial pyramid pooling, thus they accomplish to implement an image segmentation model for objects existed at multiple scales. Cascaded module is introduced to encode multi-scale information, it doubles the atrous rates. Atrous spatial pyramid pooling module is augmented with image-level featues to probe the features at multiple sampling rates and effective field-of-views.
They evaluate the model on PASCAL VOC 2012 semantic segmentation benchmark [29] with the same dataset in [30]. When a 3×3 atrous convolution with a high rate is applied, it fails to capture long range information because of image boundary effects. To overcome this issue, they propose to incorporate image-level features into the ASPP module. They mention that their model Deep Labv3 improves previous works [31], [ 32] by the rate of 85.7% on test set.
3.b) Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation
Chen et al [ 33 ] build a model with two types of neural networks with spatial pyramid pooling module and encoder-decoder structure to capture rich contextual information at different resolution as well as to obtain sharp object boundaries. While parallel atrous convolution coming from DeepLabv3 [34] capture the contextual information at multiple scales, PSPNet [35] performs pooling operations at different grid scales. To overcome the issue of missing detailed information related to object boundaries, they apply the atrous convolution and gather a denser feature map. In DeepLabv3+, they add a decoder module to recover the detailed object boundaries.
They contribute by proposing an encoder-decoder model which can employ DeepLabv3 and arbitrarily control the resolution of extracted encoder features by atrous convolution. They also adapt Xeption model [26] to apply a depthwise separable convolution to both ASPP module and decoder module to make it faster and stronger. The proposed method reaches to performance of 89.0% and 82.1% on PASCAL VOC 2012 and Cityscapes datasets.
About the author
Ayşenur Gilik
I am a researcher at Pro2Future and working on sustainability with explainable AI. I am currently pursuing a Ph.D. at the Institute of Pervasive Computing at JKU in Linz, Austria. I finished my M.Sc. in Electronics Engineering at Kadir Has University in İstanbul, Türkiye, where I worked as a teaching assistant in Electrical-Electronics Engineering Department for four years. I have primarily worked on machine learning and Artificial Intelligence and their applications on different engineering problems. My professional interests are machine learning, artificial intelligence, computer vision, sustainability, transparent and trustworthy systems, and teaching; my personal interests are literature, cinema, writing, and photography.
References
[1] Zhou, Zongwei, et al. “Unet++: Redesigning skip connections to exploit multiscale features in image segmentation.” IEEE transactions on medical imaging 39.6 (2019): 1856-1867. https://arxiv.org/abs/1505.04597
[2] Jia, Yangqing, et al. “Caffe: Convolutional architecture for fast feature embedding.” Proceedings of the 22nd ACM international conference on Multimedia. 2014. https://arxiv.org/abs/1408.5093
[3] WWW: Web page of the em segmentation challenge, http://brainiac2.mit.edu/ isbi_challenge/
[4] U-net implementation, trained networks and supplementary material available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net
[5] J. Long, E. Shelhamer and T. Darrell, “Fully convolutional networks for semantic segmentation,” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3431-3440, doi: 10.1109/CVPR.2015.7298965.
[6] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE CVPR. pp. 3431–3440 (2015) https://arxiv.org/abs/1411.4038
[7] Oktay, Ozan, et al. “Attention u-net: Learning where to look for the pancreas.” arXiv preprint arXiv:1804.03999 (2018). https://arxiv.org/abs/1804.03999
[8] Zhou, Zongwei, et al. “Unet++: A nested u-net architecture for medical image segmentation.” Deep learning in medical image analysis and multimodal learning for clinical decision support. Springer, Cham, 2018. 3-11. https://arxiv.org/abs/1807.10165
[9] Milletari, Fausto, Nassir Navab, and Seyed-Ahmad Ahmadi. “V-net: Fully convolutional neural networks for volumetric medical image segmentation.” 2016 fourth international conference on 3D vision (3DV). IEEE, 2016. https://arxiv.org/abs/1606.04797
[10] Chen, Liang-Chieh, et al. “Semantic image segmentation with deep convolutional nets and fully connected crfs.” arXiv preprint arXiv:1412.7062 (2014). https://arxiv.org/abs/1412.7062
[11] Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014). https://arxiv.org/abs/1409.1556
[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
[13] Everingham, M., Eslami, S.M.A., Van Gool, L. et al. The PASCAL Visual Object Classes Challenge: A Retrospective. Int J Comput Vis 111, 98–136 (2015). https://doi.org/10.1007/s11263-014-0733-5.
[14] Badrinarayanan, Vijay, Alex Kendall, and Roberto Cipolla. “Segnet: A deep convolutional encoder-decoder architecture for image segmentation.” IEEE transactions on pattern analysis and machine intelligence 39.12 (2017): 2481-2495. https://arxiv.org/abs/1511.00561
[15] Noh, Hyeonwoo, Seunghoon Hong, and Bohyung Han. “Learning deconvolution network for semantic segmentation.” Proceedings of the IEEE international conference on computer vision. 2015. https://arxiv.org/abs/1505.04366
[16] Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. “U-net: Convolutional networks for biomedical image segmentation.” International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015. https://arxiv.org/abs/1505.04597
[17] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “What is the best multi-stage architecture for object recognition?,” in ICCV, pp. 2146– 2153, 2009.
[18] He, Kaiming, et al. “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.” Proceedings of the IEEE international conference on computer vision. 2015. https://arxiv.org/abs/1502.01852
[19] Bottou, L. (2010). Large-Scale Machine Learning with Stochastic Gradient Descent. In: Lechevallier, Y., Saporta, G. (eds) Proceedings of COMPSTAT’2010. Physica-Verlag HD. https://doi.org/10.1007/978-3-7908-2604-3_16
[20] Long, Jonathan, Evan Shelhamer, and Trevor Darrell. “Fully convolutional networks for semantic segmentation.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. https://arxiv.org/abs/1411.4038
[21] Yu, Fisher, and Vladlen Koltun. “Multi-scale context aggregation by dilated convolutions.” arXiv preprint arXiv:1511.07122 (2015). https://arxiv.org/abs/1511.07122
[22] Chen, Liang-Chieh, et al. “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.” IEEE transactions on pattern analysis and machine intelligence 40.4 (2017): 834-848. https://arxiv.org/abs/1606.00915
[23] Chen, Liang-Chieh, et al. “Semantic image segmentation with deep convolutional nets and fully connected crfs.” arXiv preprint arXiv:1412.7062 (2014). https://arxiv.org/abs/1412.7062
[24] Yu, Fisher, and Vladlen Koltun. “Multi-scale context aggregation by dilated convolutions.” arXiv preprint arXiv:1511.07122 (2015). https://arxiv.org/abs/1511.07122
[25] Chen, Liang-Chieh, et al. “Rethinking atrous convolution for semantic image segmentation.” arXiv preprint arXiv:1706.05587 (2017). https://arxiv.org/abs/1706.05587
[26] Chen, Liang-Chieh, et al. “Encoder-decoder with atrous separable convolution for semantic image segmentation.” Proceedings of the European conference on computer vision (ECCV). 2018. https://arxiv.org/abs/1802.02611
[27] Howard, Andrew G., et al. “Mobilenets: Efficient convolutional neural networks for mobile vision applications.” arXiv preprint arXiv:1704.04861 (2017). https://arxiv.org/abs/1704.04861
[28] Chen, Liang-Chieh, et al. “Rethinking atrous convolution for semantic image segmentation.” arXiv preprint arXiv:1706.05587 (2017). https://arxiv.org/abs/1706.05587
[29] Everingham, Mark et al. “The Pascal Visual Object Classes Challenge: A Retrospective.” International Journal of Computer Vision 111 (2014): 98-136.
[30] Chen, Liang-Chieh, et al. “Semantic image segmentation with deep convolutional nets and fully connected crfs.” arXiv preprint arXiv:1412.7062 (2014). https://arxiv.org/abs/1412.7062
[31] Chen, Liang-Chieh, et al. “Semantic image segmentation with deep convolutional nets and fully connected crfs.” arXiv preprint arXiv:1412.7062 (2014). https://arxiv.org/abs/1412.7062
[32] Chen, Liang-Chieh, et al. “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.” IEEE transactions on pattern analysis and machine intelligence 40.4 (2017): 834-848. https://arxiv.org/abs/1606.00915
[33] Chen, Liang-Chieh, et al. “Encoder-decoder with atrous separable convolution for semantic image segmentation.” Proceedings of the European conference on computer vision (ECCV). 2018. https://arxiv.org/abs/1802.02611
[34] Chen, Liang-Chieh, et al. “Rethinking atrous convolution for semantic image segmentation.” arXiv preprint arXiv:1706.05587 (2017). https://arxiv.org/abs/1706.05587
[35] Zhao, Hengshuang, et al. “Pyramid scene parsing network.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. https://arxiv.org/abs/1612.01105
[36] Huang, Huimin, et al. “Unet 3+: A full-scale connected unet for medical image segmentation.” ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020. https://arxiv.org/abs/2004.08790
[37] Zhou, Zongwei, et al. “Unet++: Redesigning skip connections to exploit multiscale features in image segmentation.” IEEE transactions on medical imaging 39.6 (2019): 1856-1867. https://arxiv.org/abs/1912.05074