Looking to speed up data labeling with just one click? Learn about Interactive Segmentation.
Unlock the power of Interactive Segmentation and revolutionize your data labeling process with ease.
In this article, we'll explore click-based interactive segmentation and its benefits for speeding up the data labeling process in semantic segmentation. Segmentation models have proven useful in numerous fields, including analyzing medical images, video understanding, and even self-driving cars. Moreover, we recently talked about the background replacement task and face beautification in our article, in which we presented a new large open-source dataset for portrait segmentation and face parsing along with a set of pre-trained models.
What is image segmentation?
How about we dive into image segmentation and explore the different types? Whether you want to refresh your memory or learn something new, let's start!
Image segmentation is a task of computer vision, which is to classify each pixel of an image and thereby separate objects on it into separate segments (masks). Image Segmentation has three subtasks: semantic segmentation, instance segmentation, and panoptic segmentation. Each subtask has unique features and subcategories to consider when working on the task.
Semantic segmentation is an image segmentation task that assigns each pixel to a given class or set of classes. With a single target class, such a task is called binary segmentation. Face parsing, portrait segmentation, and scene understanding are subfields of semantic segmentation.
Instance Segmentation aims to localize the area of pixels with target objects into separated segments (instances). Such segmentation is beneficial if we expect object tracking in the video or if the segmented objects are too small and too close to each other for accurate pixel-by-pixel segmentation. Most often, examples of instance segmentation can be seen in the problems of analyzing geospatial data and medical images and recognizing the environment for self-driving cars and robotics.
Panoptic segmentation combines semantic and instance segmentation by returning the sum of their masks. But keep in mind that with this type of segmentation, each image pixel should have only one class label, i.e., if objects in the image overlap, panoptical segmentation will cope with such occlusion worse than the instance segmentation described above.
All three types of image segmentation have their characteristics, but their main common disadvantage is the data labeling process if the existing datasets don’t fit any criteria for traning models on them.
Neural network models for image segmentation are trained by approximating the ground-truth segmentation masks. Therefore, if there is no suitable dataset, then such ground-truth masks can be obtained through manual labeling on data annotation platforms (Toloka.AI, Amazon Mechanical Turk, Supervisely, Elementary) or synthetic data generation. When using synthetic images to train models, their quality may be lower than those trained on real data. To use synthetic data, an additional model must be developed to generate high-quality data and annotations. Despite its high cost, duration, and the likelihood of human error, the manual labeling approach is often chosen.
Using interactive segmentation, you can simplify the process of manual labeling and make additional contributions to model training, namely:
- speed up the labeling process with the possible minimization of the number of crowd workers because of partially automated segmentation mask drawing;
- improve the quality of the models due to the increase in the set of qualitatively labeled data;
- reduce the cost of the final markup.
Below we will discuss in more detail how Interactive Segmentation works.
Interactive Segmentation
One approach to interactive segmentation is based on the use of clicks, with which the user can mark up masks. There are two types of clicks: positive ones, which render the mask of a given class, and negative ones, which can be used to correct the original mask. For example, if the resulting segmentation mask doesn't completely cover the desired area of pixels, then you can make another positive click. Likewise, if there are extra pixels in the segmentation mask, you can put a negative click, which will refine it.
All approaches to solving this problem aim to get the maximum possible value of the Intersection over Union (IoU) metric for the minimum number of clicks. Usually, 90 is chosen as a maximum value for IoU and written NoC@90, where NoC is the number of clicks.
Thus, in one click or several clicks, you can get the desired segmentation mask for the target object and reduce the time spent on marking through polygon points, as is usually implemented on crowdsourcing platforms. It is also worth noting that when using interactive segmentation approaches, masks are obtained with smooth edges, repeating the shape of the segmented object as much as possible. To achieve this effect classically by drawing a mask through creating a polygon, you need to put a more significant number of points, significantly increasing the markup time.
For example, the Google team didn't pass by the benefits of interactive segmentation and added the ability to segment objects of target classes with a single click in their MediaPipe Studio. Their solution, Magic Touch, is based on an architecture similar to MobileNetV3 but with a custom decoder. The choice of such an architecture made it possible to obtain a model with a size of less than 10 MB and provides a speed of ~130 milliseconds on the CPU.
It's essential to additionally tell that there is an interactive segmentation that works not only on clicks. Some approaches involve interacting with the model based on the specified areas, a rectangle for which you must set 4 coordinates. But picking up the coordinates of such areas is more complicated than placing a click. Also, other objects may reduce the segmentation's accuracy in the selected area. You can also mention approaches that accept a set of lines (scribble-based) drawn to users by the object area for segmentation. But this idea is complicated by the fact that during training, it requires a realistic simulation of drawing such "doodles." Finally, some models for interactive segmentation combine several data domains at once for a better segmentation result — for example, text describing the object for segmentation and clicks.
The general scheme of how interactive segmentation works is as follows:
Interactive Segmentation Methods
Let’s look at the most popular approaches to solving the problem of interactive segmentation.
Reviving Iterative Training with Mask Guidance for Interactive Segmentation (RITM)
This work introduced the idea of correcting segmentation masks from other models and interacting with them through negative and positive user clicks. This idea formed the basis of the latest state-of-the-art approaches to interactive segmentation, so it deserves special attention.
The authors of RITM chose an iterative approach to train their model using information about the segmentation result in the previous step. A convolutional neural network with the HRNet + OCR architecture was selected as the base pipeline of the model.
The user interacts with the model through positive and negative clicks, represented on the image by coordinates. For the selected model to return the correct result based on the given clicks, it’s necessary to translate them into a spatial representation. To do this, RITM considered cliques as circular disks of small radius. The value for the radius of these circles is usually chosen in the range from 3 to 5.
The authors also noted that in most cases, segmentation models are pre-trained on the ImageNet dataset and work only with an RGB image as input. Therefore, you need to change the first convolutional layer to give the model additional information as input. It will take an RGB image and already spatially represented user clicks as additional channels. The authors named this modification of the first convolutional layer Conv1E, the architectural features shown in the figure below.
DFM (Distance Maps Fusion) is a module that transforms an input image and a channel with additional information into an input with three channels. Conv1S is a modified convolution block that allows you to use an RGB image, and the user clicks as input while maintaining the dimensions of all tensors. Based on their experiments, the authors concluded that combining HRNet-18 + ResNet-34 + Conv1S gives the best results.
RITM also uses an updated approach to simulating user clicks:
in the beginning, cliques are randomly generated without taking into account the connection between them, and then an iterative process is applied, which consists of the following steps:
- clicks are distributed not only in the center of incorrectly marked segmentation masks but also in the same areas after their area is reduced by four times through morphological erosion. If you place clicks only in the center, this entails retraining the model and poorly simulates the actual location of user clicks;
- during training, clicks are generated individually for each batch, so the number of click sampling iterations was limited.
According to the presented metrics, using this approach, you can get a model that allows you to get a reasonably accurate segmentation mask in 2–4 clicks on average.
SimpleClick
The authors of this work focused on the development of such a solution that would have the following properties:
- a simple module architecture for extracting image features;
- the fast response of the model is combined with the high accuracy of the results.
It previously appeared solutions for interactive segmentation are based on hierarchical convolution models with pyramidal processing of feature maps (Feature Pyramid Network) or complex transformer models (Swin Transformer). The use of complex hierarchical architectures is motivated by the need to extract the maximum amount of information about the global context of the image. Such architectures allow you to do this by processing feature maps, which contain the result of applying convolutional layers of different levels to the input image.
The SimpleClick model is based on the idea from the latest successful solutions to the problem of object detection — the use of a Vision Transformer (ViT) with Masked Autoencoders (MAE) and window attention. This architectural combination eliminates complex hierarchical layer sets for obtaining global context information because all feature maps in ViT are the same size, and the global context from the image can be obtained through the self-attention layer. In SimpleClick, only the last feature map is used to predict the segmentation mask — it’s fed into a simple feature pyramid, which is necessary to implement the segmentation mechanism. Then a lightweight multilayer perception (MLP) is applied. This solution has been tested on ten public benchmarks, including images from common domains and medical data. On average, the SimpleClick model takes 2–3 clicks to get an accurate mask.
The SimpleClick solution can be divided into three parts:
- a ViT image feature extraction module that processes feature maps of the same size. Also, patch embedding layers divide the input image and the clicked mask into non-overlapping parts (patches) of a fixed size (16×16 for ViT-B), each of which is projected into a vector of the specified length (768 for ViT-B). This sequence of patches is sent to the transformer block queue for self-attention. The authors of SimpleClick presented a model with three variations of the feature extraction module: ViT-B, ViT-L, and ViT-H. They were all pre-trained with Masked AutoEncoders on ImageNet-1k. This block of all attention mechanisms returns a feature map, according to which the segmentation mask will be predicted further;
- due to the use of only the last feature map containing the most detailed representation of the image, you need only the four convolutional layers with different filter steps to create a segmentation mechanism;
- a lightweight MLP that processes the feature map from each convolutional layer and scales them up to the same size for subsequent concatenation. Next, the resulting tensor is converted into a single-channel feature map format to obtain a predicted segmentation mask.
It’s important to clarify that user clicks are also processed at the patch embedding stage of the layer in the image feature extraction module, and the segmentation result with previous clicks is used as input, allowing you to refine the mask in the current markup step. The user’s clicks are encoded as two-channel circle masks combined with the previous segmentation as input. In the diagram, the authors omitted positional coding to avoid making the diagram challenging to understand.
This model was trained with automatic click simulation based on the current segmentation results and the target sample mask. Clicks were distributed randomly and iteratively — this idea was taken from RITM, which we analyzed above. Random clicks were generated in parallel without considering the order of clicks. The interactive click strategy thought the peculiarities of human behavior: the next click was placed in the area that needed to be refined after the previous step.
Segment Anything
Currently, the Segment Anything (SAM) model can be called a state-of-the-art solution for interactive segmentation. It was used to mark up the new SA-1B dataset, in which 11 million images correspond to over one billion masks. The authors of SAM call their development a fundamental model — this model has many parameters and is trained on a dataset with more than one million data instances on average. Due to the acquired generalization ability, the fundamental model can return the result for one specific task and similar related tasks. This concept came from NLP models.
The user can interact with this model through clicks and a text description of the objects searched for in the image, an approximate mask of the object that will be automatically completed, or the coordinates of the area containing the objects for marking. With these new interaction formats, SAM operates in few-shot and one-shot modes, i.e., it only requires a small amount of data to return the most accurate result.
The Segment Anything model works like this:
- the input image is fed into the image encoder, and embeddings are returned — a low-dimensional representation of the image. ViT was chosen as the image encoder with the pre-trained MAE for better learning of the internal image representation, and additionally, the image encoder was minimally adapted to work with higher-resolution images;
- the model can work not only with masks but also with boxes, text, and points. In the solution architecture diagram, the prompt encoder, mask decoder, and convolutional encoder work with them. The prompt encoder uses positional encoding for points and boxes, the CLIP model for text, three convolutional layers with GELU activations, and a mask normalization layer. The encoded results are then added to the already learned image embeddings;
- the mask decoder converts the sum of image and prompt embeddings into masks. This module consists of two layers, and each layer performs four steps: self-attention to tokens (prompt embeddings), cross-attention from tokens to image embedding, multilayer perceptron (MLP) updates each token, and in the last step, image embeddings receive information from prompt data tokens via cross-attention. Each attention block contains residual convolutions, normalization layers, and a dropout. Next, there is a return to the original dimensions of the image embedding through transposed convolutional layers and mask prediction with IoU metrics.
Yes, not one mask is predicted, but three at once. Such a quantity was found empirically by the authors — to segment the desired object in the image, and it is enough to decompose the predicted masks “in-depth” into a whole mask, into a part, and a subpart. An example of such a decomposition in depth can be seen in the picture below:
The architecture of the Segment Anything solution is motivated by the need to work efficiently. The authors note that the model can run on the CPU in the browser at a speed of ~50 milliseconds. Here it is necessary to clarify that the specified speed is achieved with a pre-calculated image embedding. This response time allows SAM to be used for interactive segmentation.
Many data annotation platforms have appreciated the capabilities of the Segment Anything model and have begun to implement them to make it easier for users to obtain segmentation masks. For example, the ABC Elementary team has already added this model to their markup platform:
The only problem Segment Anything has is the need to retrain the model on your data if it differs from those used in the SA-1B dataset. For example, SAM users note that it doesn’t do well with aerial or medical images. We also tested SAM on images from our EasyPortrait dataset and found that this model is unsuitable for marking data for the Face Parsing task. For example, in the image below, getting a segmentation mask with facial skin is impossible.
Interactive Segmentation Datasets
In all machine learning tasks, data plays an essential role in the final quality of the models. Neural network approaches for interactive segmentation are no exception, and even vice versa — for their training, it’s crucial to use datasets with high-quality annotation because the result of their work is supposed to be used to train models in other tasks. Therefore, we want to achieve high-quality automatic data annotation, so the interactive segmentation model should be “familiar” with the most significant number of possible target classes and their visual representations.
The following datasets train interactive segmentation models: Semantic Boundaries Dataset (SBD), Pascal VOC, LVIS, COCO, and DAVIS. Also, these datasets are combined and trained immediately on a large amount of data.
The SA-1B dataset has been added, containing 11 million images with 1.1 billion corresponding high-quality segmentation masks. The SAM model completely marked up these masks. To check their correctness, 500 images with ~50 thousand segmentation masks were selected, and professional markers were hired, who used the “brush” and “eraser” tools to achieve pixel-by-pixel accuracy of the masks. Further, the IoU metric estimated how the automatically received masks differed from their corrected versions. Also, the authors of SAM and SA-1B posted a markup quality rating based on people’s ratings. This rating used original sample masks from different datasets and predicted masks from SAM and RITM models. And as we can see, on some datasets, the segmentation results from models are not inferior to the original masks-samples marked manually.
The listed datasets are usually used to solve semantic or instance segmentation problems. The necessary data about the model and user interaction is generated during training and doesn’t require additional manual labeling.
Conclusions
Today we introduced you to the interactive segmentation problem and the latest popular approaches to solving it. These approaches demonstrate the ability to obtain a segmentation mask with pixel-by-pixel accuracy in a couple of clicks, which is not inferior to manual data labeling. Furthermore, in some cases, you need an annotation in the form of segmentation masks for a narrowly focused domain, such as medical images. In that case, you can achieve the same high results on your data simply by retraining RITM or SimpleClick on a dataset with a similar domain. So, manually annotated data may still be needed for interactive segmentation. Still, it’s possible to label the data only partially, using the model already trained on their domain for the remaining unlabeled samples.
Active implementation of interactive segmentation in markup platforms can significantly improve the quality of current models because the annotation of a large amount of diverse data will become better, cheaper, and more accessible. In addition, interactive segmentation can be used not only for data labeling but also in photo/video editors for processing an area selected with a click, as well as in medical programs to facilitate the routine work of doctors.