Incremental Few-shot Instance Segmentation (CVPR 2021)

- Paper Link: https://arxiv.org/abs/2105.05312

 

Incremental Few-Shot Instance Segmentation

Few-shot instance segmentation methods are promising when labeled training data for novel classes is scarce. However, current approaches do not facilitate flexible addition of novel classes. They also require that examples of each class are provided at tra

arxiv.org

 

 

 

Abstract

- Current Few-Shot Instance Segmentation (FSIS) approaches do not facilitate flexible addition of novel classes

- present the first incremental approach to FSIS: iMTFA

- learn discriminative embeddings for object instances that are merged into class representatives

- strong embedding vectors rather than images solves the memory overhead problem

- match these class embeddings at the RoI-level using cosine similarity

 

 

1. Introduction

- Few-shot learning address the problem of learning with limited available data

- Few-shot learning: base classes (numerous training samples) + novel classes (scarce K samples). The goal is to train a system to correctly classify N classes: only the novel classes or, both novel and base classes jointly.

 

- Previous few-shot object detection (FSOD) and few-shot instance segmentation (FSIS) requires long training procedures with both novel and base class samples

- This is unpractical when we want to add novel classes to a trained network.

- Incremental few-shot learning: the addition of novel class is independent from previous data.

 

Figure 1. Incremental few-shot instance segmentation.

- the first incremental few-shot instance segmentation method: iMTFA

- a two-stage training and fine-tuning approaches based on Mask R-CNN

- first stage: Mask R-CNN

- second stage: the FC layers at the RoI level are re-purposed. Transform a fixed feature extractor into an Instance Feature Extractor (IFE) that produces the discriminative embeddings that are aligned with the per-class representatives. These embeddings are subsequently used as weights inside a cosine-similarity classifier

 

- Advantages

1) Eliminates the need for extensive retraining procedures for new classes because these can be added incrementally

2) Mask predictor is class-agnostic. No mask labels are needed for the addition of novel classes

3) No performance drawbacks at test time.

 

- Contributions

1) the first incremental few-shot instance segmentation method: iMTFA which outperforms S.O.T.A for FSIS and incremental FSOD.

2) To compare between incremental and non-incremental methods, we extend an existing FSOD approach to the instance segmentation task (MTFA), and also S.O.T.A results.

 

 

2. Related Work

< Instance Segmentation >

- Grouping-based: per-pixel information that is post-processed to obtain instance segmentations
- Proposal-based: e.g., Mask R-CNN (RPN). do not perform well with a small training dataset.

 

< Few-shot Learning >

- enables models to accommodate new classes for which little training data is available

- episodic methodology: providing query items to be classified into N classes and a support set containing training examples of the N classes

- Optimization-based: meta-learner

- Metric-learning: learns a feature embedding such that objects from the same class are close in the embedding space and objects of different classes are far apart

 

< Few-shot object detection >

- TFA(ICML2020): first trains Faster R-CNN on the base classes and then only fine-tunes the predictor heads

 

< Few-shot instance segmentation >

- few works

- most approaches provide guidance to certain parts of the Mask R-CNN architecture to ensure the network is better informed of the novel classes

- Meta R-CNN and Siamese Mask R-CNN: compute the embeddings of support set and combine these with the feature map produces by the network backbone to provide additional information at a certain stage
- FGN: guides the RPN, RoI detector and mask upsampling layers with the support set feature embeddings

 

< Incremental few-shot object detection >

- ONCE(CVPR2020): CenterNet as a backbone to learn a class-agnostic feature extractor and a per-class code generator network for novel classes

 

< Incremental few-shot instance segmentation >

- FGN and Siamese Mask R-CNN: depend on being passed examples of every class at test time (large amount of memory)

- Meta R-CNN can pre-produce per-class attention vectors, but requires retraining to handle a different number of classes 

- iMTFA can incrementally add classes without retraining on requiring examples of base classes

 

 

 

 

3. Methodology

 

< 3.1 Formulation of few-shot instance segmentation >

- a set of base classes $C_{base}$ and a disjoint set of novel classes $C_{novel}$

- Goal: train a model that does well on the novel classes $C_{test}  = C_{novel}$ or $C_{test} = C_{base} \cup C_{novel}$

- Episodic-training methodology: a series of periods $E_i=(I^q, S_i)$ where $S_i$ is a support set containing N classes from $C_{train} = C_{novel} \cup C_{base}$ along with K examples per class (N-way K-shot)

- A network is then tasked to classify an image $I^q$, termed query, out of the classes in $S_i$.

- Solving different classification task each episode leads to better generalization and results on $C_{novel}$

- extend to FSOD and FSIS: considering all objects in an image as queries and having a single support set per-image instead of per-query.

- The challenge of FSIS: not only to classify the query objects, but also to determine their localization and segmentation

< 3.2 MTFA: A non-incremental baseline approach >

Figure 2. Architecture of TFA and MTFA

- Extends the Two-Stage Fine-tuning (TFA) object detection method

- TFA: Faster R-CNN with a two-stage training scheme.
1. First stage: the network is trained on the base classes $C_{base}$
2. Second stage: feature extractor $F$ is frozen and only the prediction heads are trained on a dataset containing an equal number of $C_{base}$ and $C_{novel}$ classes.

- MTFA: extends TFA similarity to how Mask R-CNN extends Faster R-CNN: by adding mask prediction branch at the RoI level. 

- Cosine-similarity classifier: a cosine-similarity classifier $C$ is used for RoI classifier to learn more discriminative per-class representatives. $C$ : fully-connected layer (embeddings from $F$  -> classification score $S$ ) and $C$ is parameterized by weight matrix $W \in \mathbb{R}^{e\times c} = [w_1, w_2, ..., w_c]$ ($e$: the size of an embedding vector, $c$: the number of classes). Classification score $S_{i,j}=F(X)_i^T \cdot w_j$

 

< 3.3 iMTFA: Incremental MTFA >