[논문 리뷰] Self-training with Noisy Student improves ImageNet classification

Feb 08, 2025

Contents

1. Introduction 2. NoisyStudent: Iterative Self-training with Noise 3. Experiments 4. Ablation Study: The Importance of Noise in Self-training 5. Related works 6. Conclusion

EffiientNet 학습 → teacher의 역할로서, pseudo label 형성 → student model로서, 원래 label + pseudo label에 대해 larger EfficientNet 학습

student를 다시 teacher로 사용하면서 위의 과정 반복

teacher은 noised되지 않음 → pseudo label과 original label이 최대한 비슷해야하기 때문

student는 teacher보다 더 나은 일반화 성능을 보여야 함 → dropout, stochastic depth, data augmentation 등의 추가적인 작업을 함

💡

[Dropout vs. stochastic depth]

Dropout

각 node(neuron)을 확률적으로 제거
과적합 방지

stochastic depth

각 Residual Block(layer 전체)를 확률적으로 제거
Bolck 단위로 제거됨 → network의 깊이를 동적으로 조절함

1. Introduction

labeled images에 대해서만 모델 수행 → 더 많은 양의 unlabeled images에 대해서는 정확도와 robustness를 높이기 어려웠음

본 연구에서는 unlabeled images에 대한 SOTA 정확도를 높이는 방법을 소개하고자 함

self-training framework

labeled image에 대해 teacher model 학습
teacher이 unlabeled model에 대해 pseudo labels를 형성하도록 사용
labeled images + pseudo images에 대해 student model 학습

teacher model은 noised되지 않으며, student model은 noised 되어야하지 성능이 좋음

pseudo label이 가능한 한 정확해야 하며, student은 pseudo label로부터 최대한 열심히 학습

NoisyStudent : student를 noise하기 위해 Dropout, stochastic depth, data augmentation

2. NoisyStudent: Iterative Self-training with Noise

함수 : 모델의 예측값

함수 : loss function

input : labeled images & unlabeled images

labeled images : n개
unlabeled images : m개

teacher model 학습 : labeled images 사용, standard cross entropy loss 사용

noise 有

teacher model이 unlabeled images에 대해 pseudo label 생성

soft : continuous distr.
hard : one-hot distr. (하나만 1이고 나머지는 모두 0)
noise 無

student model 학습 : labeled images와 unlabeled images에 대해 cross entropy loss가 최소가 되도록 학습

noise 有

새로운 pseudo label을 형성하고 새로운 student를 학습시키기 위해 본래의 student를 teacher로 넣기 → 반복

이 방식은 semi-supervised learning 에서 쓰인 방식과 비슷 BUT 차이점 有

student에게 더 많은 noise 추가
teacher보다 크거나 같은 크기의 student model 사용

위의 차이점은 Knowledge Distillation과 본 연구에서 소개한 방법의 차이점이기도 함

Knowledge Distillation

noise를 첨가하는 것이 주요 관심 요소가 아님
teacher보다 더 빠른 속도를 위해 작은 model을 사용
큰 teacher model 학습 → 작은 student model 정의 → teacher model이 예측한 확률 분포를 student model이 모방하여 soft label(continuous) 생성 → studenet model 학습

NoisyStudent method에서는 studenet model에게 noise의 측면에서 더 어려운 환경을 부여하여 teacher보다 더 나아진 model이 되도록 함

Noising Student

student가 의도적으로 noised 되었을 때, noised되지 않은 더 강력한 teacher과 일관성을 가지도록 학습됨
input noise

RandAugmentation

invariant constraint(불변성 : 이미지를 translation시켜도 동일한 카테고리로 분류해야 함) 학습
student model이 단순히 teacher model을 모방하는 것을 넘어, 더 어려움 이미지도 예측할 수 있도록

model noise

Dropout
Stochastic depth
teacher model이 pseudo label을 생성할 때, 마치 앙상블 모델(ensemble)처럼 동작하게 함
반면, student model은 single model로 동작
즉, student model이 더 강력한 앙상블 모델을 모방하도록 함 → student model이 더 일반화되고 성능이 향상될 수 있음

unlabeled data에 noise를 적용 → label이 있는 데이터와 없는 데이터 모두에서 decisiton function의 local smoothness를 강화하는 이점을 가짐 (즉, 모델이 더 일반화될 수 있도록 도움)

Other Techniques

data filtering

teacher model이 낮은 confidence를 가진 images를 걸러냄

balancing

unlabeled images가 training set와 매칭되기 위해, 모든 클래스에서 unlabeled images의 수가 labeled images와 비슷한 수가 되도록 조정
따라서, 충분한 이미지가 없는 클래스에서는 이미지를 복제

soft & hard pseudod label 둘 다 성능이 좋았지만, soft pseudo label이 unlabeled data에 대해 조금 더 성능이 좋았음

3. Experiments

robustness dataset과 adversarial attack에서 좋은 성능을 보임

💡

[robutsness와 adversarial attack]

3.1 Experiment Detaisl

Labeled dataset

Unlabeled dataset

JFT dataset : label이 있었지만, label을 무시하고, unlabeled data로 취급
data filtering과 balancing 수행
EfficientNet-B0 학습 → 신뢰도가 0.3보다 높은 것만 선택(data filtering) → 각 class에서 신로도가 높은 순으로 13만개씩만 선택 → 13만개가 안 되는 클래스는 복제해서 13만개로 개수 맞춤(balancing)

Architecture

EfficientNets
EfficientNet-B7의 scale을 증가시켜 EfficientNet-L2 얻음 (wider & deeper)
학습시간은 scale up하기 전보다 5배 더 많이 소요되긴 함…

Training details

batch size = 2048
EfficientNet-B4 보다 더 큰 student model들은 350 epoch으로 학습
그보다 작은 student model들은 700 epoch으로 학습
unlabeled images는 큰 batch size 사용 (최대한 full로 사용하고자 함)
labeled images와 unlabeled images들은 average cross entropy loss로 합쳐짐
train-test resolution discrepancy (train-test 해상도 불일치 문제)

train time과 test time에서 사용하는 이미지 해상도가 다를 경우 발생하는 문제
보통 학습할 때는 작은 해상도를 사용해 빠르게 학습하고, 테스트할 때는 높은 해상도를 사용해 성능을 극대화 → 이렇게 하면, train에서 본 해상도와 실제 test의 해상도가 달라서 성능 저하(discrepancy)가 발생할 수 있음
본 연구에서는 작은 해상도로 normal training(350 epoch) & 큰 해상도로 fine-tuning(1.5 epoch)

Noise

stochastic depth, dropout, RandAugmentation
stochastic depth

suvival probability : 마지막 layer에 대해 0.8이 되도록 나머지는 linear decay rule
RandAugmentation : 크기가 27로 설정된 두 가지 무작위 연산을 적용

Iterative training

NoisyStudent algorithm을 3번 반복했을 때가 best model
EfficientNet-B7 이 teacher → Efficient-L2이 studnet (unlabeled batch size를 labeled batch size의 14배로 설정) → EfficientNet-L2 모델을 teacher로 사용하여 새로운 EfficientNet-L2 모델을 학습(2nd) → 또 반복 (unlabeled batch size를 labeled batch size의 28배로 설정) (3rd)

3.2 ImageNet Results

정확도 높음

다른 모델들과 비교했을 때 labeled image가 적게 필요하고, unlabeled image 또한 적게 필요함 → data를 수집하기에 쉬움

파라미터의 수도 다른 모델들(FixRes-ResNeXt-101 WSL)에 비해 훨씬 적음

Model size study: NoisyStudnet for EfficientNet B0-B7 without Iterative Training

NoisyStudent 방법이 EfficientNet 모델에서 효과적인지를 실험하기 위해 B0~B7 모두에서 실행해봄 (단, 반복 X)
RadnAugmentation 적용
unbalanced batch size를 balanced batch size의 3배로 설정
반복 없이, EfficientNet에 NoisyStudent 알고리즘을 1회 실행했을 때, 아래의 표와 같이 모두 성능 개선의 효과를 보여줌

3.3 Robustness Results on ImageNet-A, ImageNet-B and ImageNet-P

C, P : blurring, fogging, rotation, scaling과 같은 corruption과 perturbation(변형)을 포함

(224, 224) 와 (229, 229) 두 가지 해상도를 가진 이미지로 평가
P : (224, 224)에 대해 meanp flip rate = 14.2 // (229, 229)에 대해 mean flip rate = 12.2
robustness를 위해 의도적으로 최적화하지 않았지만, 좋은 성능을 보였기에 놀라운 결과라고 할 수 있음

A : 다른 SOTA 모델들이 틀리기 쉬운 어려운 images들로 구성된 dataset , C, P와는 다른 training image

여타 다른 SOTA 모델들보다 위와 같은 robustness dataset에 대해 좋은 성능을 보임

3.4 Adversarial Robustness Results

Efficient-L2 모델에 NoisyStudent 유무로 FGSM attaack에 대해 평가

💡

[FGSM attack]

Fast Gradient Sign Method

input image에 작은 왜곡을 추가

loss function의 gradient를 계산 → 이 gradient의 부호(sign)을 이용해 input iamge에 작은 변화 추가 → 이 변화를 통해 모델이 이미지의 중요한 특징을 잘못 인식하도록 유도

아래 표에서 볼 수 있다시피, adversarial robustness를 위해 최적화하지도 않았는데, 눈에 띄게 정확돌 향상을 보임

(strong attack)에서 1.1% → 4.4% 로 정확도를 높임

4. Ablation Study: The Importance of Noise in Self-training

noise의 중요성

ablation : 구성 요소나 방법을 제거하거나 변경하여 그 영향을 평가하는 실험

teacher model에서 생성된 soft pseudo lable 사용 → student model이 teacher model과 정확히 동일하게 훈련되면, label이 없는 data에서 cross-entropy loss가 0이 되어 훈련 신호가 사라짐

따라서, student model이 soft pseudo label을 사용하면서 teacher model의 성능을 능가할 수 있는 방법을 고안해야 함

이 방법이 바로, student model에 noise를 추가하는 것임

실험방법

unlabeled image의 양

unlabeled image의 양이 많으면, student model이 teacher model의 예측을 더 많이 학습할 수 있음 → 하지만, teacher model이 잘못된 예측을 하면 student model이 잘못된 정보를 학습할 수 있으므로, noise를 추가하는 것이 중요
unlabeled image의 양이 많으면, student model이 teacher model을 과도하게 따라하게 됨 → overfitting 가능성 높음

Teacher model의 정확도

정확도가 높음 teacher model → student model이 teacher model을 잘 학습 → noise가 큰 영향을 미치지 않을 수 있음
정확도가 낮은 teacher model → student model이 teacher model의 잘못된 예측을 학습할 수 있음 → noise를 추가하여 student model이 더 나은 일반화 능려을 가지도록 유도해야 함

unlabeled image의 양이 많고, teacher model의 정확도 낮음 → noise 중요성 커짐
unlabeled image의 양이 적고, teacher model의 정확도가 높음 → noise의 중요성 줄어듦

표 설명

NoiseyStudent (student → Aug, SD, Dropout)
student → SD, Dropout
student에 noise 모두 제거
student뿐만 아니라 teacher에도 noise 추가

stochastic depth, dropout, data augmentation과 같은 noise는 student model이 teacher model보다 더 좋은 성능을 내는 데에 중요한 역할을 함

labeled image → noise를 제거하는 것은 lower training loss

noise가 모델의 학습에 주요한 영향을 미침

unlabeled image → smaller drop in training loss

대규모의 unlabeled image에서는 overfiggind기 덜 일어남 → noise를 제거한다고 해서 성능 저하가 크게 나타나지 않음

→ 대규모의 unlabeled image가 overfit하는 것이 더 어려움

noise는 student model이 teacher model의 예측을 그대로 학습하는 대신, 더 일반적인 패턴을 학습하도록 유도

unlabeled image가 많음 → model아 모든 data를 기억하는 것이 거의 불가능해짐 → model이 모든 data를 완벽하게 외우는 것이 어려워, 일반화된 성능을 낼 수 있음 → 즉, 대규모의 unlabeled image에서는 과적합이 발생되기 어려움

teacher model에 noise를 추가 → 더 낮은 정확도 → unnoised teacher model이 더 성능이 좋음

5. Related works

self-training

good teacher model을 학습하기 위해 labeled data 사용 → unlabeled data를 라벨링하기 위해 teacher model 사용 → lebeled data + unlabeled data로 studente model 학습
보통의 self-trainig framework에서는 noise를 첨가하는 것이 디폴트값이 아니었으며, noise의 역할이 정당화되지도 않았음
선행연구와 본 연구의 차이점 : 본 연구에서는 noise의 중요성을 증명했으며, noise를 적극적으로 사용
이전의 연구에서는 ImageNet-A,C,P에 대해 robustness에 대해서는 연구 X (”unlabeled image에 대해 학습 → 마지막 단계에서 labeled image에 대해 파인튜닝”이라는 방법론 소개)
teacher을 강화시키기 위해 different transformation이 적용된 이미지의 예측을 합치는 Data Distilltation → teacher이 아닌 student를 강화시키려고 했던 본 연구와 접근법이 반대임
student에 noise를 조금만 추가하는 것 또한 student를 teacher보다 강하게 만드는 것을 힘들게 함
multiple teacher로부터 예측을 합치는 것은 본 연구의 NoisyStudent 방법보다 비용이 더 많이 듦
Co-training : 특징을 두 개의 disjoint 부분으로 나눠서, 두 모델이 labeled data를 사용하여 두 개의 특징 집합을 두 모델이 학습하도로고 함 → 본 연구에서 noise를 추가한 것이 teacher과 sstudent가 서로 다른 예측을 하도록 만들 수 있음

Semi-supervised Learning

noise가 추가된 input image에 대해 invariant를 가지도록, 일관적인 예측을 하도록 함
Consistency Regularization : unlabeled data에 대해 예측을 일관되게 유지하도록 유도하는 방법

ImageNet과 같은 대규모 dataset에서는 훈련 초기에 예측이 불확실하므로, 모델이 높은 entropy(예측의 불확실성을 나타내는 지표) 예측을 하게 됨 → 성능 저하….. → Entropy Minimizaiton으로 해결 → 근데 이 방법은 복잡……

따라서, 일관성 유지를 위해 Entropy Minimization보다 self-training / Teacher-student framework 사용

labeled data를 사용하여 teacher model을 훈련시키고, 이 모델을 기반으로 studnet model을 학습 → ImageNet처럼 대규모 dataset에 적합

pseudo label 기반의 학습은 self-training과 유사 & 일관성 측면에서도 self-training과 동일한 문제점을 가지고 있음. (teacher model이 잘못된 pseudo label을 형성하면 student model은 이를 기반으로 잘못된 패턴 학습)
semi-supervised learning의 다양한 접근법

graph-based method : data 간의 관계를 그래프 형태로 모델링 → unlabeled data에 대해 예측을 개선
latent variable(잠재 변수) 사용 : latent variable을 target variable로 사용하여 data를 설명하고 예측
low-density seperation(저밀도 분리) : data의 밀도가 낮은 영역을 기준으로 분류 → unlabed data에서 중요한 패턴을 추출하는데 유용

→ 이러한, semi-supervised learning의 다양한 기법들이 pseudo label 기반 학습 방법의 한계를 보완할 수 있는 보완적인 이점 제공

Knowledge Distillation

본 연구에서 Knowledge distillation을 사용한 부분 : student model을 더 작게 만들기 위해 model compression 수행
차이점 : Knowledge distillation는 unlabeled data를 고려하지 않았으며, student model의 성능을 향상시키려고도 하지 않음

Robutness

본 연구는 unlabeled data의 사용이 정확도와 일반적인 robustness를 높인다는 것을 보여줌
unlabeled data가 adversarial robustness를 높인다는 선행 연구와도 일치
선행연구는 unlabeled data에 대해 adversarial robustness를 위해 직접적으로 최적화를 했지만, 본 연구의 NoisyStudent 방법은 robustness에 대해 직접적으로 최적화하지 않음

💡

[직접적인 최적화]

모델을 adversarial attack에 강하게 만들기 위해 특별히 설계된 방법을 사용

adversarial training : adversarial attack을 미리 생성하고, 모델을 학습 → adversarial example 사용
FGSM (Fast gradient sign method) : gradient를 사용하여 adversarial peturbation 생성 → 이를 모델의 학습에 반영하여 adversarial robustness 향상
PGD(projected gradient desent) : FGSM보다 더 강력한 공격을 생성하는 방법

NoisyStudent 방법은 위와 같은 방법을 사용하지 않고, 간접적으로 noise를 추가하여 모델을 견고하게 만듦 → 즉, 모델이 adversarial attack에 강한 특정을 가지도록 training data에 attack을 삽입하여 훈련하는 방식이 아니라, noise를 사용하여 모델이 다양한 입력에 강해지도록 유도 (adversarial example에 대해 gradient를 사용하는 등의 직접적인 최적화를 하지 않음)

6. Conclusion

SOTA의 정확도와 robustness 향상을 위해 unlabeled image를 사용할 수 있음

student model이 teacher model을 뛰어넘기 위해 student에 noise를 추가한 NoisyStudent 를 제시