neuron.ai

Image Segmentation I

Image segmentation is used to identify objects of different classes in an image in the form of segmentation masks rather than bounding boxes as it is the case in object detection. It is used for many practical applications including medical image analysis, computer vision for autonomous vehicles, face recognition and detection, video surveillance, and satellite image analysis.

By the use of an Image Segmentation process, which is preferred in the stages of reading and processing images by machines, it is possible to determine the borders of the objects exactly after the detection at the pixel level. In other words, it is the creation of common areas at the pixel level by separating all the different features in the image to be processed and separating these areas in a meaningful way.

This article is the first of a series on image segmentation. This article focuses on encoder-decoder models for medical and biomedical image segmentation. In the upcoming weeks we are going to post more articles on convolutional models with graphical models and dilated convolutional models and DeepLab family.

1. Encoder-Decoder Models for Medical and Biomedical Image Segmentation

1.a) U-Net: Convolutional Networks for Biomedical Image Segmentation

There are two main popular architectures under the encoder-decoder models for image segmentation: (1) U-Net and (2) V-Net. Ronneberger et al. [1] proposes U-Net for segmentation of biological microscopy images.

The U-Net architecture with example for 32×32 pixels in the lowest resolution is given in Figure 1. U-Net contains basically two parts; contracting path (left side) and expansive path (right side). Contracting path follows a typical convolutional network architecture; the repeated application of two 3×3 convolutions, rectified linear unit (ReLU), 2×2 max-pooling with stride 2 for down-sampling. For expansive path, each step has up-sampling of the feature map followed by a 2×2 convolution, a concatenation with the cropped feature map from the contracting path, and two 3×3 convolutions followed by a ReLU. The expansive path is symmetric to the contracting path and provides a U-shaped architecture. In the up-sampling part, there are many feature channels allowing the network to produce context information to higher resolution layers. At the end of the network, there is a 1×1 convolution to map 64-component feature vectors to the desired number of classes. Finally, total of 23 convolutional layers are used in the network.

They use input images and corresponding segmentation maps for training with the stochastic gradient descent implementation of Caffe [2]. They show the application of U-Net in three different segmentation tasks. First task is for the neuronal structures in electron microscopic recordings with the data set provided by the EM segmentation challenge [3]. The training data is a set of 30 images with 512×512 pixels. For each image, there is a corresponding fully annotated ground truth segmentation map.

It is seen that U-Net achieves a good performance of various biomedical segmentation applications and is provided with the full Caffe-based implementation and the trained networks [4].

Figure 1: U-net architecture with example for 32x32 pixels in the lowest resolution.

1.b) Attention U-Net: Learning Where to Look for the Pancreas

Fully convolutional networks [5] and U-Net [6] rely on multi-stage cascaded CNNs which result high redundant use of computational resources and parameters. Oktay et. al [7] propose attention gates (AGs) to use on training of CNN models from scratch in a similar way to the training of FCNs and provide automatic learning without additional supervision. They introduce the implementation of AG in U-Net architecture (named Attention U-Net) and apply it on medical images of abdominal CT multi-label segmentation problem. They report that the proposed AGs improve model sensitivity and accuracy for dense label predictions by compressing feature activations in irrelevant regions. According to the results, AGs are successful to improve prediction accuracy across different datasets and training sizes. They use two different datasets: (I) 150 abdominal 3D CT scans acquired from patients diagnosed with gastric cancer (CT-150) and (II) 82 contrast enhanced 3D CT scans with pancreas manual annotations performed slice-by-slice (CT-82).

1.c) UNet++: A Nested U-Net Architecture for Medical Image Segmentation

The skip connections in FCN and U-Net models ensure effectiveness in recovering fine-grained details of the targets by generating segmentation masks with fine details on complex background. They are also successful on instance-level segmentation of occluded objects.

Zhou et al [8] present U-Net++ architecture for image segmentation based on nested and dense skip connections to design more effective segmentation model which can recover the fine details of the target objects in medical images.

Figure 2 presents an overview of the proposed model starting with an encoder sub-network and followed by a decoder sub-network. While the skip pathways (shown in green and blue) are re-designed to transform the connectivity of the sub-networks to reduce the semantic gap between the feature maps of the encoder and decoder sub-networks, a use of deep supervision[6] (shown red) is created to enable the model to operate in two modes: (1) accurate mode and (2) fast mode to enable more accurate segmentation particularly for the target that appear at multiple scales. They use four medical image datasets of lesions/organs for model evaluation. he results are compared with the baseline models U-Net and wide U-Net, results shown that UNet++ with deep supervision achieved an average IoU gain of 3.9 and 3.4 points over U-Net and wide U-Net, respectively. For more application of U-Net based architectures on image segmentation problems, see; [36, 37].

Figure 2: UNet++ with an encoder and decoder that are connected through a series of nested dense convolutional blocks.

1.d) V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation

Milletari et al [9] propose a method for 3D medical image segmentation by fully convolutional neural network with volumetric convolutions. A new objective function based on Dice coefficient maximization is presented and optimized during training. The model is trained end-to-end on MRI volumes of prostate to predict the segmentation for the whole volume at once. The architecture of the model is designed to extract the features from the data and beside this, to reduce its resolution using stride.

Figure 3 shows the schematic representation of V-Net architecture with volumetric convolutions to process 3D data. While the first part of the model compresses the signal, the second part decompresses until the original size. The first part is divided into stages operating at different resolutions by one to three convolutional layers. The input of the stage is processed through the non-linearities and added to the output of the last convolutional layer of the stage to learn a residual function. The convolutions are performed using volumetric kernels with the size 5x5x5 voxels and the resolution of the data is reduced through different stages along the compression path by 2x2x2 voxels wide kernels with stride 2. They apply PReLU non linearities throughout the network and replace pooling with convolutional operations for smaller memory footprint during training. Additionally, down-sampling reduces the size of the input signal and provides several features two times higher than the one of the previous layers. The second part is to extract features and to expand the spatial support of the lower resolution feature maps. In this way, the information is extracted and two channel volumetric segmentation is returned. Output in the same size with the input volume is generated from two feature maps by the last convolutional layer with 1x1x1 kernel. Each stage of the second part is followed by a de-convolution which increases the size of the input after one to three convolutional layers.

Figure 3: Schematic representation of V-Net architecture.

They define a new objective function based on Dice coefficient maximisation which does not require sample re-weighting to calculate the probability of voxels to belong to foreground and to background. The network is trained on a dataset of prostate scans in MRI with the input size of 128x128x64 and the spatial resolution of 1x1x1.5 millimetres and tested on 30 MRI volumes.

About the author

Ayşenur Gilik

I am a researcher at Pro2Future and working on sustainability with explainable AI. I am currently pursuing a Ph.D. at the Institute of Pervasive Computing at JKU in Linz, Austria. I finished my M.Sc. in Electronics Engineering at Kadir Has University in İstanbul, Türkiye, where I worked as a teaching assistant in Electrical-Electronics Engineering Department for four years. I have primarily worked on machine learning and Artificial Intelligence and their applications on different engineering problems. My professional interests are machine learning, artificial intelligence, computer vision, sustainability, transparent and trustworthy systems, and teaching; my personal interests are literature, cinema, writing, and photography.

References

[1] Zhou, Zongwei, et al. “Unet++: Redesigning skip connections to exploit multiscale features in image segmentation.” IEEE transactions on medical imaging 39.6 (2019): 1856-1867. https://arxiv.org/abs/1505.04597

[2] Jia, Yangqing, et al. “Caffe: Convolutional architecture for fast feature embedding.” Proceedings of the 22nd ACM international conference on Multimedia. 2014. https://arxiv.org/abs/1408.5093

[3] WWW: Web page of the em segmentation challenge, http://brainiac2.mit.edu/ isbi_challenge/

[4] U-net implementation, trained networks and supplementary material available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net

[5] J. Long, E. Shelhamer and T. Darrell, “Fully convolutional networks for semantic segmentation,” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3431-3440, doi: 10.1109/CVPR.2015.7298965.

[6] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE CVPR. pp. 3431–3440 (2015) https://arxiv.org/abs/1411.4038

[7] Oktay, Ozan, et al. “Attention u-net: Learning where to look for the pancreas.” arXiv preprint arXiv:1804.03999 (2018). https://arxiv.org/abs/1804.03999

[8] Zhou, Zongwei, et al. “Unet++: A nested u-net architecture for medical image segmentation.” Deep learning in medical image analysis and multimodal learning for clinical decision support. Springer, Cham, 2018. 3-11. https://arxiv.org/abs/1807.10165

[9] Milletari, Fausto, Nassir Navab, and Seyed-Ahmad Ahmadi. “V-net: Fully convolutional neural networks for volumetric medical image segmentation.” 2016 fourth international conference on 3D vision (3DV). IEEE, 2016. https://arxiv.org/abs/1606.04797

[10] Chen, Liang-Chieh, et al. “Semantic image segmentation with deep convolutional nets and fully connected crfs.” arXiv preprint arXiv:1412.7062 (2014). https://arxiv.org/abs/1412.7062

[11] Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014). https://arxiv.org/abs/1409.1556

[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.

[13] Everingham, M., Eslami, S.M.A., Van Gool, L. et al. The PASCAL Visual Object Classes Challenge: A Retrospective. Int J Comput Vis 111, 98–136 (2015). https://doi.org/10.1007/s11263-014-0733-5.

[14] Badrinarayanan, Vijay, Alex Kendall, and Roberto Cipolla. “Segnet: A deep convolutional encoder-decoder architecture for image segmentation.” IEEE transactions on pattern analysis and machine intelligence 39.12 (2017): 2481-2495. https://arxiv.org/abs/1511.00561

[15] Noh, Hyeonwoo, Seunghoon Hong, and Bohyung Han. “Learning deconvolution network for semantic segmentation.” Proceedings of the IEEE international conference on computer vision. 2015. https://arxiv.org/abs/1505.04366

[16] Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. “U-net: Convolutional networks for biomedical image segmentation.” International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015. https://arxiv.org/abs/1505.04597

[17] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “What is the best multi-stage architecture for object recognition?,” in ICCV, pp. 2146– 2153, 2009.

[18] He, Kaiming, et al. “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.” Proceedings of the IEEE international conference on computer vision. 2015. https://arxiv.org/abs/1502.01852

[19] Bottou, L. (2010). Large-Scale Machine Learning with Stochastic Gradient Descent. In: Lechevallier, Y., Saporta, G. (eds) Proceedings of COMPSTAT’2010. Physica-Verlag HD. https://doi.org/10.1007/978-3-7908-2604-3_16

[20] Long, Jonathan, Evan Shelhamer, and Trevor Darrell. “Fully convolutional networks for semantic segmentation.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. https://arxiv.org/abs/1411.4038

[21] Yu, Fisher, and Vladlen Koltun. “Multi-scale context aggregation by dilated convolutions.” arXiv preprint arXiv:1511.07122 (2015). https://arxiv.org/abs/1511.07122

[22] Chen, Liang-Chieh, et al. “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.” IEEE transactions on pattern analysis and machine intelligence 40.4 (2017): 834-848. https://arxiv.org/abs/1606.00915

[23] Chen, Liang-Chieh, et al. “Semantic image segmentation with deep convolutional nets and fully connected crfs.” arXiv preprint arXiv:1412.7062 (2014). https://arxiv.org/abs/1412.7062

[24] Yu, Fisher, and Vladlen Koltun. “Multi-scale context aggregation by dilated convolutions.” arXiv preprint arXiv:1511.07122 (2015). https://arxiv.org/abs/1511.07122

[25] Chen, Liang-Chieh, et al. “Rethinking atrous convolution for semantic image segmentation.” arXiv preprint arXiv:1706.05587 (2017). https://arxiv.org/abs/1706.05587

[26] Chen, Liang-Chieh, et al. “Encoder-decoder with atrous separable convolution for semantic image segmentation.” Proceedings of the European conference on computer vision (ECCV). 2018. https://arxiv.org/abs/1802.02611

[27] Howard, Andrew G., et al. “Mobilenets: Efficient convolutional neural networks for mobile vision applications.” arXiv preprint arXiv:1704.04861 (2017). https://arxiv.org/abs/1704.04861

[28] Chen, Liang-Chieh, et al. “Rethinking atrous convolution for semantic image segmentation.” arXiv preprint arXiv:1706.05587 (2017). https://arxiv.org/abs/1706.05587

[29] Everingham, Mark et al. “The Pascal Visual Object Classes Challenge: A Retrospective.” International Journal of Computer Vision 111 (2014): 98-136.

[30] Chen, Liang-Chieh, et al. “Semantic image segmentation with deep convolutional nets and fully connected crfs.” arXiv preprint arXiv:1412.7062 (2014). https://arxiv.org/abs/1412.7062

[31] Chen, Liang-Chieh, et al. “Semantic image segmentation with deep convolutional nets and fully connected crfs.” arXiv preprint arXiv:1412.7062 (2014). https://arxiv.org/abs/1412.7062

[32] Chen, Liang-Chieh, et al. “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.” IEEE transactions on pattern analysis and machine intelligence 40.4 (2017): 834-848. https://arxiv.org/abs/1606.00915

[33] Chen, Liang-Chieh, et al. “Encoder-decoder with atrous separable convolution for semantic image segmentation.” Proceedings of the European conference on computer vision (ECCV). 2018. https://arxiv.org/abs/1802.02611

[34] Chen, Liang-Chieh, et al. “Rethinking atrous convolution for semantic image segmentation.” arXiv preprint arXiv:1706.05587 (2017). https://arxiv.org/abs/1706.05587

[35] Zhao, Hengshuang, et al. “Pyramid scene parsing network.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. https://arxiv.org/abs/1612.01105

[36] Huang, Huimin, et al. “Unet 3+: A full-scale connected unet for medical image segmentation.” ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020. https://arxiv.org/abs/2004.08790

[37] Zhou, Zongwei, et al. “Unet++: Redesigning skip connections to exploit multiscale features in image segmentation.” IEEE transactions on medical imaging 39.6 (2019): 1856-1867. https://arxiv.org/abs/1912.05074