[Paper Review] Incremental Few-shot Instance Segmentation (CVPR 2021)
- Paper Link: https://arxiv.org/abs/2105.05312
Abstract
- Current Few-Shot Instance Segmentation (FSIS) approaches do not facilitate flexible addition of novel classes
- present the first incremental approach to FSIS: iMTFA
- learn discriminative embeddings for object instances that are merged into class representatives
- strong embedding vectors rather than images solves the memory overhead problem
- match these class embeddings at the RoI-level using cosine similarity
1. Introduction
- Few-shot learning address the problem of learning with limited available data
- Few-shot learning: base classes (numerous training samples) + novel classes (scarce K samples). The goal is to train a system to correctly classify N classes: only the novel classes or, both novel and base classes jointly.
- Previous few-shot object detection (FSOD) and few-shot instance segmentation (FSIS) requires long training procedures with both novel and base class samples
- This is unpractical when we want to add novel classes to a trained network.
- Incremental few-shot learning: the addition of novel class is independent from previous data.
- the first incremental few-shot instance segmentation method: iMTFA
- a two-stage training and fine-tuning approaches based on Mask R-CNN
- first stage: Mask R-CNN
- second stage: the FC layers at the RoI level are re-purposed. Transform a fixed feature extractor into an Instance Feature Extractor (IFE) that produces the discriminative embeddings that are aligned with the per-class representatives. These embeddings are subsequently used as weights inside a cosine-similarity classifier
- Advantages
1) Eliminates the need for extensive retraining procedures for new classes because these can be added incrementally
2) Mask predictor is class-agnostic. No mask labels are needed for the addition of novel classes
3) No performance drawbacks at test time.
- Contributions
1) the first incremental few-shot instance segmentation method: iMTFA which outperforms S.O.T.A for FSIS and incremental FSOD.
2) To compare between incremental and non-incremental methods, we extend an existing FSOD approach to the instance segmentation task (MTFA), and also S.O.T.A results.
2. Related Work
< Instance Segmentation >
- Grouping-based: per-pixel information that is post-processed to obtain instance segmentations
- Proposal-based: e.g., Mask R-CNN (RPN). do not perform well with a small training dataset.
< Few-shot Learning >
- enables models to accommodate new classes for which little training data is available
- episodic methodology: providing query items to be classified into N classes and a support set containing training examples of the N classes
- Optimization-based: meta-learner
- Metric-learning: learns a feature embedding such that objects from the same class are close in the embedding space and objects of different classes are far apart
< Few-shot object detection >
- TFA(ICML2020): first trains Faster R-CNN on the base classes and then only fine-tunes the predictor heads
< Few-shot instance segmentation >
- few works
- most approaches provide guidance to certain parts of the Mask R-CNN architecture to ensure the network is better informed of the novel classes
- Meta R-CNN and Siamese Mask R-CNN: compute the embeddings of support set and combine these with the feature map produces by the network backbone to provide additional information at a certain stage
- FGN: guides the RPN, RoI detector and mask upsampling layers with the support set feature embeddings
< Incremental few-shot object detection >
- ONCE(CVPR2020): CenterNet as a backbone to learn a class-agnostic feature extractor and a per-class code generator network for novel classes
< Incremental few-shot instance segmentation >
- FGN and Siamese Mask R-CNN: depend on being passed examples of every class at test time (large amount of memory)
- Meta R-CNN can pre-produce per-class attention vectors, but requires retraining to handle a different number of classes
- iMTFA can incrementally add classes without retraining on requiring examples of base classes
3. Methodology
< 3.1 Formulation of few-shot instance segmentation >
- a set of base classes $C_{base}$ and a disjoint set of novel classes $C_{novel}$
- Goal: train a model that does well on the novel classes $C_{test} = C_{novel}$ or $C_{test} = C_{base} \cup C_{novel}$
- Episodic-training methodology: a series of periods $E_i=(I^q, S_i)$ where $S_i$ is a support set containing N classes from $C_{train} = C_{novel} \cup C_{base}$ along with K examples per class (N-way K-shot)
- A network is then tasked to classify an image $I^q$, termed query, out of the classes in $S_i$.
- Solving different classification task each episode leads to better generalization and results on $C_{novel}$
- extend to FSOD and FSIS: considering all objects in an image as queries and having a single support set per-image instead of per-query.
- The challenge of FSIS: not only to classify the query objects, but also to determine their localization and segmentation
< 3.2 MTFA: A non-incremental baseline approach >
- Extends the Two-Stage Fine-tuning (TFA) object detection method
- TFA: Faster R-CNN with a two-stage training scheme.
1. First stage: the network is trained on the base classes $C_{base}$
2. Second stage: feature extractor $F$ is frozen and only the prediction heads are trained on a dataset containing an equal number of $C_{base}$ and $C_{novel}$ classes.
- MTFA: extends TFA similarity to how Mask R-CNN extends Faster R-CNN: by adding mask prediction branch at the RoI level.
- Cosine-similarity classifier: a cosine-similarity classifier $C$ is used for RoI classifier to learn more discriminative per-class representatives. $C$ : fully-connected layer (embeddings from $F$ -> classification score $S$ ) and $C$ is parameterized by weight matrix $W \in \mathbb{R}^{e\times c} = [w_1, w_2, ..., w_c]$ ($e$: the size of an embedding vector, $c$: the number of classes). Classification score $S_{i,j}=F(X)_i^T \cdot w_j$
< 3.3 iMTFA: Incremental MTFA >