TY - JOUR
T1 - Online unsupervised video object segmentation via contrastive motion clustering
AU - Xi, Lin
AU - Chen, Weihai
AU - Wu, Xingming
AU - Liu, Zhong
AU - Li, Zhengguo
N1 - Funding Information: This work was supported in part by the National Natural Science Foundation of China under Grant 51975029 and U1909215, the Key Research and Development Program of Zhejiang Province under Grant 2021C03050, the Scientific Research Project of Agriculture and Social Development of Hangzhou under Grant 2020ZDSJ0881, and in part by the National Natural Science Foundation of China under Grant 61620106012 and 61573048.
PY - 2024/2/6
Y1 - 2024/2/6
N2 - Online unsupervised video object segmentation (UVOS) uses the previous frames as its input to automatically separate the primary object(s) from a streaming video without using any further manual annotation. A major challenge is that the model has no access to the future and must rely solely on the history, i.e., the segmentation mask is predicted from the current frame as soon as it is captured. In this work, a novel contrastive motion clustering algorithm with an optical flow as its input is proposed for the online UVOS by exploiting the common fate principle that visual elements tend to be perceived as a group if they possess the same motion pattern. We build a simple and effective auto-encoder to iteratively summarize non-learnable prototypical bases for the motion pattern, while the bases in turn help learn the representation of the embedding network. Further, a contrastive learning strategy based on a boundary prior is developed to improve foreground and background feature discrimination in the representation learning stage. The proposed algorithm can be optimized on arbitrarily-scale data (i.e., frame, clip, dataset) and performed in an online fashion. Experiments on DAVIS 16, FBMS, and SegTrackV2 datasets show that the accuracy of our method surpasses the previous state-of-the-art (SoTA) online UVOS method by a margin of 0.8%, 2.9%, and 1.1%, respectively. Furthermore, by using an online deep subspace clustering to tackle the motion grouping, our method is able to achieve higher accuracy at 3 × faster inference time compared to SoTA online UVOS method, and making a good trade-off between effectiveness and efficiency. Our code is available at https://github.com/xilin1991/CluterNet.
AB - Online unsupervised video object segmentation (UVOS) uses the previous frames as its input to automatically separate the primary object(s) from a streaming video without using any further manual annotation. A major challenge is that the model has no access to the future and must rely solely on the history, i.e., the segmentation mask is predicted from the current frame as soon as it is captured. In this work, a novel contrastive motion clustering algorithm with an optical flow as its input is proposed for the online UVOS by exploiting the common fate principle that visual elements tend to be perceived as a group if they possess the same motion pattern. We build a simple and effective auto-encoder to iteratively summarize non-learnable prototypical bases for the motion pattern, while the bases in turn help learn the representation of the embedding network. Further, a contrastive learning strategy based on a boundary prior is developed to improve foreground and background feature discrimination in the representation learning stage. The proposed algorithm can be optimized on arbitrarily-scale data (i.e., frame, clip, dataset) and performed in an online fashion. Experiments on DAVIS 16, FBMS, and SegTrackV2 datasets show that the accuracy of our method surpasses the previous state-of-the-art (SoTA) online UVOS method by a margin of 0.8%, 2.9%, and 1.1%, respectively. Furthermore, by using an online deep subspace clustering to tackle the motion grouping, our method is able to achieve higher accuracy at 3 × faster inference time compared to SoTA online UVOS method, and making a good trade-off between effectiveness and efficiency. Our code is available at https://github.com/xilin1991/CluterNet.
KW - clustering methods
KW - image motion analysis
KW - Object segmentation
KW - optical flow
KW - self-supervised learning
KW - unsupervised learning
UR - http://www.scopus.com/inward/record.url?scp=85163518869&partnerID=8YFLogxK
U2 - 10.1109/TCSVT.2023.3288878
DO - 10.1109/TCSVT.2023.3288878
M3 - Article
AN - SCOPUS:85163518869
VL - 34
SP - 995
EP - 1006
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
SN - 1051-8215
IS - 2
ER -