MoCo: Momentum Contrast for Unsupervised Visual Representation Learning (CVPR 2020)

해당 논문의 concept위주로 핵심만 다루고자 합니다.

MoCo는 Unsupervised visual representation방법임 (self-supervised learning, SSL)
- image의 label에 구애받지 않고 다량의 이미지에 대해 양질의 feature를 뽑아내는 encoder─feature extractor라고 해야하나?─ 를 학습시킬 수 있음
- 이렇게 학습된 encoder를 fine-tuning하여 목표하는 downstreaming task에 사용할 수도 있을 것임
MoCo는 많은 SSL방식 중 Contrastive learning을 이용하는데, 이를 dictionary look-up형태의 관점으로 바라봄
MoCo는 기존의 다른 contrastive learning기반 SSL방법론 대비 dictionary size를 키우며, dictionary 내에 있는 example들─positive, negative─의 gap을 최소화한 방법론임

q : An encoded query ─ 학습시킬 대상의 encoding vector
$\{k_0, k_1, k_2, ...\}$ : key dictionary ─ 비교할 대상들의 encoding vector집합 (dictionary 형태)
- q와 매칭되는 key는 오로지 한 개만 존재함 ($k_+$) ─ positive key
- 나머지는 모두 negative key (one vs all)
- 따라서, loss function(contrastive loss, InfoNCE)에서 positive key와의 값은 작게, 나머지 key들과는 크게 계산됨
- 이러한 문제 formulation은 SSL을 일반화한 definition이라고 볼 수 있음 ─ $x^q$, $x^k$가 이미지가 될 수도, patches가 될 수도 있음
실제 수행 pretext task : instance discrimination
- 같은 image에서 encoding된 (query, key) pair는 positive, 다른 image면 negative pair로 두고 이를 0 / 1로 구분
- 같은 image에서 random data augmentation을 통해 두 가지 view─즉, encoding─을 생성함
Encoder 구조 : ResNet (last FC layer는 128-d를 갖는 layer)
Data augmentation 방법들 (다음 중 random하게 선택)
- 224 × 224 crop
- color jittering (밝기, 채도 등을 임의로 변경)
- random horizontal flip
- random grayscale conversion

: Method detail

본 방법론의 핵심은 dictionary를 두어 query-key구조로 contrastive learning example을 고르는 것에 있음
핵심 가정은 풍부한 negative sample들을 가진 large dictionary를 통해 양질의 feature가 학습될 수 있다는 것임
- 하지만, 이 방법이 binary classification에서도 통용되는 가정일까?
- class가 많은 multi-class image dataset에서는 다양한 class에서의 negative sample들이 true class를 보장해주지만─소거법처럼
- binary의 경우, multi-class case처럼 잘 working할 지는 의문임─이론상으로는 보다 적은 dictionary로도 가능할 것임. 정답이 이거 아니면 저거 이므로...

Contrastive learning의 sample집합인 dictionary를 queue구조로 유지함 (FIFO)
dictionary size는 mini-batch size보다는 크게 유지하되, 매 mini-batch마다 dictionary를 업데이트함
이 때, oldest mini-batch를 가장 먼저 제거 ─ 새로 들어온 example과 가장 덜 consistent하므로
- dictionary에 mini-batch단위로 들어가는건지? 그러면 (dictionary size) = (mini-batch) × N이 되나?

queue 구조가 dictionary size를 크게 가져갈 수 있게 하였지만 ─ why? back-prop으로 key encoder를 학습시키기에는 무리가 있음
이러한 이슈를 해결하기 위해 momentum update라는 개념을 도입함

back-prop으로 update하는 parameter는 query encoder($\theta_q$)뿐이고
key encoder는 원래의 key encoder와 query encoder의 linear combination으로 update함 ─ key encoder를 직접 back-prop으로 update하는 것보다 smoothing되는 효과
queue에 매 mini-batch마다 key encoding vector를 update하는데, 그 사이에 key encoder도 함께 update되기 때문에 mini-batch들이 저장될 때마다 서로 다른 encoder에서 encoding된 값이 저장됨
그렇게되면, 실제로는 같은 class를 가진 key여도 다르게 encoding될 여지가 있음
하지만, 이러한 momentum update는 그러한 차이를 줄여줌 ─ 여기서는 m을 0.999로 주는 것이 0.9로 주는 것보다 좋은 결과를 얻었다고 함(즉, update를 크게 안 할수록 도움이 되었다!, 그러나 조금씩 발전은 해야 한다)

: queue 구조가 dictionary size를 크게 가져갈 수 있게 된 이유에 대한 답이자, 이 논문의 contribution

(a) end-to-end 방식

current mini-batch = dictionary (그래서 update할 key sample들이 항상 같은 encoder에서 나온 값들임)
하지만, dictionary size가 mini-batch size에 크게 dependent함 ─ 그리고 이것은 GPU memory에도 dependency가 걸림
이러한 문제는 보다 큰 dictionary size를 요구하는 SSL task일 때 치명적임 ─ local position을 고려하는 pretext task같은 경우

(b) memory bank 방식

모든 dataset을 memory bank에 넣어두고 학습 step마다 sampling하여 사용하는 방법 (/wo back-prop)
하지만, sampled keys가 꺼내질 때마다 업데이트되므로 memory bank에 있는 모든 key들은 각기 다른 (step에서 update된) encoder로부터 도출된 값이므로 less consistent함

(c) MoCo

근본적으로 memory bank방식을 차용하되, momentum update를 통해 encoder gap을 줄이고 보다 큰 dictionary size를 가질 수 있게 하였음
encoder gap을 줄인 것이 dictionary size를 크게 키울 수 있던 key factor임

Universal Source-free Domain Adaptation (0)	2024.01.30
Exploiting the Intrinsic Neighborhood Structure for Source-free Domain Adaptation (NIPS 2021) (0)	2024.01.24
Guiding Pseudo-labels with Uncertainty Estimation for Source-free Unsupervised Domain Adaptation (0)	2024.01.23
AdaContrast: Contrasitive Test-Time Adaptation (CVPR 2022) (0)	2024.01.09

숨니의 무작정 따라하기