Exploiting the Intrinsic Neighborhood Structure for Source-free Domain Adaptation (NIPS 2021)

AI/Domain Adaptation

Exploiting the Intrinsic Neighborhood Structure for Source-free Domain Adaptation (NIPS 2021)

숨니야 2024. 1. 24. 08:36

본 포스팅은 Exploiting the Intrinsic Neighborhood Structure for Source-free Domain Adaptation논문의 리뷰를 다루고 있습니다.

개인적 고찰은 파란색으로 작성하였습니다. 해당 내용에 대한 토론을 환영합니다 :)

Motivation

그림1. t-SNE visualization of target features by source model

본 논문은 closed-set source-free domain adaptation 방법론에 대해 다루고 있음
방법론의 직접적은 motivation은 위 그림처럼 target domain의 data들이 source domain classifier의 decision boundary로 명확히 나눌 수 는 없어도 source model이 extract한 target feature자체는 class cluster를 형성한다는 사실임

Method

Overview: Neighborbood Reciprocal Clustering (NRC)

Motivation section에서 언급한 사실을 직접적으로 이용하는 방법은 feature space에서 nearest neighbor를 구해서 이 들을 같은 class로 mapping하는 것임
하지만, 이는 모든 data에 대해서 hold하지 않은데,
- 아래 그래프는 k-NN의 k에 따라 predicted label과 correct label이 일치하는 비율을 나타낸 것임
- 아래 그래프에서 파란색 curve를 보면 제일 인접한 sample들(k=1)만을 골랐을 때조차도 오직 75%만이 classifier에서 correct label과 일치하는 prediction을 했다는 것―NN들 중에서도 선별할 필요가 있다는 점을 시사함
- 이러한 문제를 다루기 위해서 reciprocal nearest neighbor (RNN)의 개념을 도입하여, 이 RNN들을 보았더니 단순 nearest neighbor(NN) sample들보다 더 좋은 prediction―correct label과 일치하는 prediciton―을 하는 양질의 sample들이었다는 것(검정색 curve)
  - k-reciprocal nearest neighbor : k-nearest neighbor의 확장된 개념으로 서로를 k-NN으로 갖는 관계를 말함
- 그래서, 이러한 RNN sample들과 RNN이 아닌 sample들 (nRNN)을 분리하고 각 sample마다 different weighted supervision을 주자는 것이 본 논문의 핵심

Learning Objective

source model이 잘 학습되어 있다고 가정할 때, 다음의 learning objective로 학습함

$n_t$ : target sample의 수
$D_t$ : target data distribution (target dataset)
$Neigh(x_i)$ : $x_i$의 nearest neighbor
$p_i$ : $x_i$의 softmax probability
$D_{sim}(\cdot, \cdot)$ : similarity btw. data
$D_{dis}(\cdot, \cdot)$ : dissimilarity btw. data

한 마디로 nearest neighbor의 softmax끼리의 similarity는 maximize하고 sample들간의 dissimilarity (a constant measuring the semantic distnace)는 minimize하는 objective임
위 learning objective는 개념 이해를 위한 overview일 뿐이며, Dsim과 Ddis의 개념을 담기 위해 아래와 같이 총 4가지의 loss를 정의하고 합한 것을 최종 loss function으로 사용함

Actual learning objective (Final learning objective)

1. Neighbor Affinity for Class-Cosistency ($\cal{L}_\cal{N}$) → Similiarity를 고려하는함수

affinity라는 개념을 위 overview에서 1 / $D_{dis}$에 해당하는 개념으로 정의
Two memory banks
- $F$ : target feature bank
- $S$ : corresponding prediction score bank
- 매 training mini-batch마다 학습 전에 memory bank를 현재 mini-batch로 업데이트 시킴

$\cal{L}_\cal{N}$ function
- $A_{jk}$ : the affinity value of k-th nearest neighbor of feature $z_i$
  - RNN일 경우에 1, nRNN일 경우에 r(=0.1, hyperparameter)
- $N^{i}_K$ : feature $z_i$의 index set of k-nearest neighbor (즉, 모든 training 시 data의 순서는 바뀌지 않음-no shuffling)
- $S_k$ : memory bank $S$에서의 k번째 item (softmax probability)
- 즉, 위 함수는 prediction끼리의 similarity (overview에서 $D_{sim}$)를 affinity로 weight하는 것

2. Self-regularization ($\cal{L}_\cal{self}$)→ 각 sample의 자아(?)를 공고히하는함수

위 neighbor affinity loss ($\cal{L}_\cal{N}$)에서 affinity를 제외한 것 → 즉, prediction간의 dot product값
Noisy neighbor―다른 class이지만 RNN인 sample ― 의 Impact을 줄이기 위해 고안된 loss
(－)의 minimize이므로 current prediction을 더 고려한다는 의미로 해석
여기서 $S_i$와 $p_i$는 각각 constant vector, variable이고 값은 서로 같음 (∵ mini-batch 학습 전에 memory bank를 업데이트함)
- 그래서, loss는 $p_i$값의 gradient에 대해서만 back-prop됨

3. Prediction Diversity Loss ($\cal{L}_\cal{div}$)

모델이 모든 데이터를 특정 class로만 prediction하거나 어떠한 class로도 prediction하지 못하는―degeneration solution―을 방지하기 위한 함수
$C$ = # of source classes = # of target classes (closed-set problem)
$p^{(c)}_i$ : $c^{th}$ class의 score
$\bar{p}_c$ : $p^{(c)}_i$의 empirical label distribution
수식적인 의미로 보자면, 억지로 softmax distribution을 uniform(1/C)하게 맞춤

4. Expanded Neighborhood Affinity ($\cal{L}_E$)

더 많은 nearest neighbor를 고려할수록 얻을 수 있는 정보는 많은 데에 비해, neighbor data point가 많아질수록 정보의 혼재 가능성 trade-off가 있음
여러 class의 sample들에 속하는 neighbor들을 고려하게되어 어떤 class를 mapping해야할 지 판단하기 어렵게 됨 (class consistency가 위해됨)
그리하여 M-nearest neighbor를 확장하여 고려하는 것을 제안함
이는 k-nearest neighbor의 k'-nearest neighbor (k'=k+M)를 고려하자는 것이 아닌 "k-NN들의 M-NN"을 고려하자는 것임
이 것이 무슨 차이인가 하니... (분류의 기준이 되는 sample을 ego-sample이라고 할 때)이렇게 k-NN에 속한 neighbor들의 M-NN까지 고려하면, 더 많은 example들을 고려할 수 있게 됨

$E_{M}(z_i)$ : Expanded neighbors of feature $z_i$를 아래와 같이 정의

이 때, $E_{M}(z_i)$는 index set이고($N^i_k$처럼) ego-sample, $i$는 이 set에 속하지 않아야 함
위 Neighbor affinity($\cal{L}_\cal{N}$)에서는 affinity로 weight을 주었으나, k-NN보다는 더 먼 neighbor들이므로 r값으로 모두 통일하여 부여함(대신 작게―Experiment section에서는 0.1이 가장 좋았다고 함)
- 이 수식에서 affinity값이 모두 r로 고정되어, expanded neighbor sample들의 importance(더 가까운 sample들에 대한 차등부여)가 반영되지 않는 것은 아닌가 생각할 수 있으나, k-NN의 M-NN을 구하는 과정 속에서 동일한 sample이 서로 다른 k-neighbor의 M-neighbor가 될 수 있음
- 그러므로, $E_{M}(z_i)$ 에는 중복된 index(sample)들이 존재하고, r값이 모두 동일하더라도 index set 내에서의 frequency를 통해 이러한 차등부여가 가능함

Algorithm

Experiment

Datasets
- 2D image benchmark dataset
- 3D point cloud recognition dataset
  - 3D point cloud란, 3D 공간 정보를 시각적으로 표현하기 위해 3D물체 인식 센서가 위치정보를 가진 수많은 점들을 모아서 공간을 표현하는 것을 말함

Evaluation : Source-present / Source-free의 비교 (SF:Source-free)

2D image

Office-31

Source-present의 방법론보다는 낮은 성능이나, source-free함을 고려하면 크게 뒤쳐지지 않는 성능을 보임
Source-free 방법론들 중에서는 3C-GAN과 얼추 비슷한 수준을 보임

Office-Home

대부분의 adaptation에서 outperform함

VisDa

Source-free방법론인 SHOT과 source-present방법론의 RWOT를 평균적으로 3%, 1.9%로 각각 outperform함

3D point cloud

PointDA-10

soure-present 방법론들보다도 4%이상 outperform함

Ablation study

(left, middle) 각 loss들을 제거하면서 효과성을 검증함
$\cal{L}_{\hat{\mathit{E}}}$ notation은 $\cal{L}_\mathit{E}$의 $E_{M}(z_i)$에서 중복된 sample들을 제거하고 계산한 것 (left, middle)
- 마지막 row와 그 위의 row를 비교했을 때, 중복된 sample들을 제거하지 않아서 frequency기반의 importance가 반영된 것이 효과적임 ( $\cal{L}_{\mathit{E}}$ > $\cal{L}_ \hat {\mathit{E}}$)
- 다만, Expanded M neighbor를 단독으로 사용할 때보다 high affinity(A)를 고려해야 성능 향상이 있음
- Affinity를 고려하지 않는다는 것은, 모든 M neighbor들을 같은 비중으로 고려한다는 말과 같은데, 이는 이전에 언급했듯이 neighbor가 많아질 수록 noise가 될 수도 있다는 것의 방증임
(right) 단순히 k를 크게 확장한 k-NN과 Expanded M neighbor를 비교한 것의 비교

(left, middle) 각 데이터셋에서 self-regulariation의 유무에 따른 성능 차이
(right) Aiffinity에서 nRNN의 가중치인 r에 따른 성능 변화

저작자표시 비영리 동일조건