BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

본 포스팅은 BERT의 논문 리뷰를 다루고 있습니다.

개인적 고찰은 파란색으로 작성하였습니다. 해당 내용에 대한 토론을 환영합니다 :)

Introduction

Language model pre-training은 많은 NLP task에서 효과를 입증해 옴
- Sentence-level task: NLI (Natural Language Inference)
- Token-level task: NER (Name Entity Recognition), QA (Question and answering)
Pre-trained language representation task는 크게 두 가지 approach가 존재
- Feature-based: Task-specific architecture에 pre-trained representation module을 추가하여 additional feature로써 이용하는 방법 (e.g. ELMo)
- Fine-tuning: Task-specific parameter를 최소화하고 representation을 pre-training한 parameter를 task에 맞게 세부 조정하는 방법 (e.g. OpenAI GPT)
쟁점 : 이러한 방식들이 pre-trained representation의 효과를 제대로 이용하고 있지 못함 (특히, fine-tuning 방법)
- Shallow bi-directional 혹은 uni-directional model
- Pre-training model architecture가 한정적
- 이러한 요소들은 NLP task에서 sub-optimal이 되거나 context를 제대로 고려하지 못하게 됨

이에, 본 논문에서 제한할 BERT는 새로운 pre-training method들을 통해 pre-trained language representation을 학습하였음
- Masked language model (MLM)
  - Model 구조가 uni-directional한 것도 문제지만, ELMo는 bi-directional LSTM을 통해 이를 해결하고자 하였음
  - 다만 pre-training method 자체의 uni-directionality를 배제할 수 없음
  - N-gram 방식의 경우, 앞의 n개의 단어를 통해 뒤의 한 단어를 예측하는 방식 (left-to-right)이기 때문임
  - BERT에서 제안하고자하는 MLM task는 input token 중 무작위로 masking하여 해당 mask의 vocab id를 맞추는 task로써 language representation을 training함
  - 이를 통해 모델은 left-to-right, right-to-left 양방향의 context를 고려할 수 있음 (또한, shallow bidirectional이 아닌 deep bidirectional을 지향)
- Next sentence prediction (NSP)
  - Text pair의 선후관계를 예측하는 task
  - 두 개의 input sentence를 받아 이 두 문장이 IsNext관계인지 아닌지를 판별 (binary classification)
- MLM, NSP를 통해 BERT는 language representation에서도 contextual representation을 중요하게 생각한다는 것을 알 수 있음

Method

BERT framework는 pre-training과 fine-tuning, 2가지 step으로 나눌 수 있음
- Pre-training: MLM, NSP (with unlabeled data)
- Fine-tuning: downstreaming task (with labeled data)

Model architecture

다양한 task에 대해 단일 architecture로 수행하는 것이 BERT의 핵심
Pre-training하는 구조에서 downstreaming task를 수행하기 위해 거의 구조 상의 차이가 없는 수준
다만 각 task별로 모델은 달라야 함 → 1 model for multi-task가 아닌, 1 architecture for multi-task

Fig 2. BERT의 학습 구조:Transfer learning (Pre-training→Fine-tuning)

예시) Question and Answering task
- Pre-training에서 MLM과 NSP를 수행하기 위해 sentence pair로 dataset을 구성하고 각 문장 내의 임의의 token을 masking
- [CLS] token은 input의 시작을 알리는 special token, [SEP] token은 문장 사이의 구분을 위한 special token
- MNLI, NER, SQuAD는 각 QA task를 위한 dataset임(1 architecture for multi-task가 아니라 1 architecture for each dataset in multi-task가 보다 정확한 표현일 것)
BERT 모델 구조는 Transformer의 Encoder부분과 동일함, 한 마디로 a multi-layer bidirectional Transformer encoder임
- The number of layers (L) : 12
- The hidden size (H) : 768
- The number of self-attention head (A) : 12
- The number of total parameters : 110M
- 위 setting은 기본 모델인 BERT-base의 setting
  - 비교상의 편의를 위하여 Open AI GPT와 model size를 동일하게 설계함
  - 다만, BERT는 bi-directional self-attention이라는 점이 다름
- BERT-Large는 L=24, H=1024, A=16, Total parameter=340M

Input/output representation

다양한 downstreaming task를 위해 input representation을, 모든 task에 대해 cover할 수 있는 형태로 embedding
- Transformer의 encoder구조만 채택하였기 때문에, BERT는 Language representation을 잘하기 위해 embedding하는 모델이라고 할 수 있음
- BERT에서의 sentence란 언어적 문장이 아닌, 단순히 연속적 의미인 token들의 sequence라고 봄
- 그래서, sequence는 하나 이상의 언어적 문장으로 이루어져 있을 수 있음
  - 여러 개의 언어적 문장 = a single sequence of tokens
- 그렇지만 언어적 문장 단위도 의미론적으로 중요하기 때문에 input representation에서 token sequence를 언어적 문장으로 분리할 수 있는 2가지 장치를 마련
  - [SEP] token으로 구별
  - 문장이 다르다는 것을 알려주기 위해 문장마다 구분된 embedding을 추가함 (segment embedding)

그림3) BERT의 input representation의 예시, 3가지 component로 구성
- Token embedding: Natural language를 tokenizing하여 model이 input을 받을 수 있게 vocab id로 mapping하는 1차원적 embedding
  - BERT는 WordPiece embedding을 채택함 (vocab_size: 30k)
- Segment embedding: 앞서 얘기한, 문장별로 다르다는 것을 지시하기 위해 추가적으로 도입한 embedding (앞 문장에는 EA, 뒷 문장에는 EB – 모두 fixed value)
- Positional embedding: RNN-family와 같이 sequential하게 input을 넣는 것이 아닌, attention만으로 model을 구현한 transformer구조이기에 token sequence에 sequential정보를 주는 embedding (simply, 순서를 알려줌)

Pre-training of BERT

Dataset: BookCorpus (800M words) English Wikipedia (2,500M words)
Wikipedia의 경우 lists, tables, headers를 제외하고 text passage만 추출
Long contiguous sequence를 추출하기 위해서 sentence-level의 corpus보다는 document-level의 corpus를 선택

Masked Language Modeling (MLM)

Sequence의 token일부를 마스킹하여([MASK] token으로 바꾸어) 해당 token의 origin token을 prediction하도록 training함
- 이러한 MLM task는 n-gram방식의 left-to-right 혹은 right-to-left의 uni-directionality 그리고 shallow bi-directionality를 대체하는 deep bi-directionality를 달성하기 위한 task임
Input sequence에서 15%의 WordPiece token을 masking함
- [MASK] token이 pre-training에서만 사용되기 때문에, pre-training과 fine-tuning 사이에 intrinsic mismatch가 발생함
- 이를 줄이기 위해서 15% token을 전부 masking하는 것이 아니라,
  - 그 중 80%는 [MASK]으로 바꾸고
  - 10%는 다른 token으로
  - 나머지 10%는 본래 token으로 둠
MLM task에서는
- 어떤 token이 random하게 바뀐 건지, 어떠한 token이 바뀌지 않은 token인지, [MASK]에는 어떠한 token을 예측해야 하는지를 모두 cover해야하기 때문에 distributional contextual representation을 달성할 수 있음
- 매 batch마다 15%를 masking해야하는 것이기 때문에, more steps는 model이 수렴할 수 있게 함
  - 다만 left-to-right 보다는 수렴 속도가 느림 → 모든 token을 예측하는 procedure보다도, 15%만을 예측하는 MLM이 느리다는 것
- But, empirical improvements가 training cost대비 뛰어나다는 것이 더 중요함
- Cross entropy loss를 이용하여 masked token을 predict하는 방법을 학습함

Next Sentence Prediction (NSP)

QA나 NLI와 같은 NLP task는 문장 간의 관계를 이해하는 것이 중요함
Model이 문장들의 관계를 이해하기 위해서 다음과 같이 2개의 문장 pair가 next sentence 관계인지를 파악하는 binary classification을 pre-training으로 제안

문장 pair를 구성할 때, 50%는 next관계로 나머지 50%는 random하게 구성함
BERT의 final layer의 [CLS] token에 해당하는 C의 output을 prediction 결과로 이용 (fine-tuning task가 classification인 경우에도 C의 output을 이용)

Fine-tuning of BERT

Transformer의 모델 구조를 차용했기 때문에, single text나 text pair든 관련없이 다양한 downstream task로의 확장이 용이함 (self-attention mechanism)
- Text pairs를 다루기 위해서, 흔히 사용되던 방법은 text pair를 bidirectional cross attention을 적용하기 전에 각각 독립적으로 encoding하였음
- 하지만 BERT는 self-attention mechanism을 이용하기에 위 encoding-attention two-stage process를 unified process로 처리할 수 있음
- Text pair → concatenated text (a single sequence)로 input을 구성하고 self-attention을 하면 보다 효율적인 bidirectional cross attention을 달성할 수 있음
  - Input : “sentence A” – “sentence B”의 4가지 형태
    - Sentence pair
    - Hypothesis-premise pairs
    - Question-passage pairs (for question answering)
    - Text-Φ (for text classification or sequence tagging)
각 task에 맞는 input/input을 BERT에 넣고 parameter를 fine-tune
Pre-training과정에 비해 fine-tuning과정은 training cost가 낮음
- A single Cloud TPU에서 1hour, a GPU에서는 a few hours

Experiments

크게 아래와 같이 4종류로, 총 11가지 NLP task (11개의 dataset)을 수행

Task category	Input	Output	Detailed tasks (datasets)
Sentence pair classification	Sentence pair	Label	MNLI, QQP, QNLI, STS-B, MRPC, RTE, SWAG
Single sentence classification	Single sentence	Label	SST-2, CoLA
Question answering	Question, Paragraph	[Start], [End] token	SQuAD v1.1
Single sentence tagging	Single senetence	Tag	CoNLL-2003, NER

BERT의 NSP에서도 언급하였지만 classification task를 수행하기 위해선 C (Transformer output of [CLS])를 이용
- A final hidden vector C는 H dimension의 vector
- Classification을 위한 layer를 추가하여 fine-tuning (K-classes classification이라면, layer의 weight W는 K x H dimension
- Standard classification loss를 이용하여 학습이 가능 → $\log(softmax(CW^T))$

The General Language Understanding Evaluation (GLUE) dataset

dataset	input	task
MNLI (Multi-genre Natural Language Inference)	sentence pair	두 번째 문장이 첫 번째 문장에 대해 [entailment, contradiction, neutral]인지 분류 - multiclass classification
QQP (Quora Question Pairs)	sentence pair(question sentences)	두 개의 질문이 동일한 질문인지 분류 - binary classification
QNLI (Question Natural Language Inference)	question-sentence pair	두 문장이 QA관계인지 분류 (SQuAD의 binary version) - binary classification
SST-2 (Stanford Sentiment Treebank)	a single sentence	영화 리뷰에 대해 어떠한 sentiment인지(positive/negative) 분류 - binary classification
CoLA (The Corpus of Linguistic Acceptability)	a single sentence	영어 문장이 grammatically "acceptable"한지 분류 - binary classification
STS-B (The Senmantic Textual Similarity Benchmark)	sentence pair	두 문장이 의미론적으로 같은 정도(similarity)를 분류 (1-5) - multiclass classification
MRPC (Microsoft Research Paraphrase Corpus)	sentence pair	두 문장이 의미론적으로 같은지 분류 - binary classification
RTE (Recognizing Textual Entailment)	sentence pair	MNLI와 같은 task, dataset 수만 더 적음 - multiclass classification
WNLI (Winograd NLI)	a single sentence	대명사가 있는 문장을 입력받아 대명사가 보기 중 어떤 것과 제일 관련 있는지 고름 - multiclass classification

GPT와 동등한 비교를 위해 WNLI는 GLUE를 수행하는데 제외하였음
- WNLI dataset은 성능 자체가 낮음 (GLUE webpage의견 - dataset 자체의 construction issue)
- 해당 task를 수행한 model들의 best accuracy가 65.1을 넘지 못함
Result

BERT-LARGE가 모든 task에 대해 SOTA를 달성함
특히, BERT-LARGE가 small dataset에 대해서도 BERT-BASE를 outperform
- BERT-LARGE가 small dataset에 대해서 불안정한 경우가 있는데, random restart를 수행하고 best model을 선택한 결과
- random restart의 경우 동일한 pre-trained checkpoint를 사용하였고 fine-tuning data를 shuffling하고 classification layer를 initialization하였음
SQuAD v1.1
- Question Answering dataset으로 question-passage를 입력받아 해당 passage에서 question이 있는 span을 찾는 task
- Input representation에서 다루었듯이 question을 Sentence A embedding, passage를 B embedding으로 구분하여 embedding함
- 정답 sequence training을 위해 [START], [END] token을 fine-tuning에 도입 (각 $H$ dimension)
- 학습 방법은 word $i$의 token에 대한 probability — [START]에 해당하는지 [END]에 해당하는지 — 계산함
  - $S$를 start vector라고 한다면, word token $i$에 대한 dot product로 해당 token이 answer에 시작 token인지를 다음 확률로 표현할 수 있음
  - An end vector $E$에 대해서도 동일하게 계산한다면 candidate span (word $i$ to word $j$)는 다음과 같이 표현이 가능 : $S \cdot T_i + E \cdot T_j$
  - $i \leq j$에 대해 각 span에 대한 log-likelihood sum을 maximize하도록 true answer span을 학습
- Result
  - 역시 기존 baseline model대비 큰 차이로 outperform

SQuAD v2.0
- SQuAD v1.1의 확장 task → question에 대한 답이 passage에 없는 case(no answer)도 함께 고려함 (more realistic)
- No answer의 경우, span의 [START], [END] token 모두 [CLS]에 생기도록 modify함
- Prediction은 다음과 같이 비교가 가능
  - The score of the no-answer span : $S_{null} = S \cdot C + E \cdot C$
  - The score of the best non-null span : $S_{\hat{i}, j} = \max_{i \leq j} S \cdot T_i + E \cdot T_j$
  - 이때, $\tau$는 validation set에 대해 전체 F1 score를 maximize하는 value로 empirical하게 선정
  - 전체 case에 비해 null span이 extreme case이므로 $S_{null}$이 무조건적으로 작을 수밖에 없다 → 이에 no answer가 ground truth임에도 non null span이 무조건 best로 뽑힐 위험이 있으므로 이를 보간해주는 역할을 $\tau$가 해줌
- Result

SWAG (The Situation With Adversarial Generations)
- 4개의 보기 중 주어진 sentence의 다음 상황으로 가장 일어날 수 있는 문장을 고르는 task
- Dataset : sentence pairs (113k)
  - Sentence A : 4개의 sentence를 concatenated
  - Sentence B : a possible continuation
- 다른 classification과 마찬가지로 [CLS] token으로부터의 C vector와 task-specific parameterdml dot product가 softmax layer를 거쳐 output을 계산
- Result

Ablation Studies

Effect of pre-training Task
- 개요: Ablation이란, 제거라는 의미로 BERT의 pre-training을 하나씩 제거하면서 성능을 확인하여 해당 task의 영향력을 확인하고자 함
- No NSP (MLM only): MLM만 사용하고 NSP task를 수행하지 않음
  - BERT vs No NSP는 NSP pre-training의 효과를 확인해 볼 수 있음
  - 결과 : BERT(MLM+NSP)의 성능이 더 좋음
- LTR & No NSP (LTR only $\leftrightarrow$ MLM only) : MLM(양방향)을 LTR(Left-To-Right, 단방향)로 대체하고, NSP도 수행하지 않음 $\rightarrow$ OpenAI GPT와 동일하지만, 더 큰 training dataset으로 훈련
  - No NSP vs LTR & No NSP는 MLM pre-training의 효과를 확인해볼 수 있음 (즉, bi-directionality의 효과성)
  - 결과 : BERT(MLM+NSP)가 성능이 더 좋으나, No NSP보다 성능이 떨어짐
- 최종 성능 비교 : LTR & No NSP < No NSP < BERT (MLM+NSP)
  - No NSP의 결과가 MNLI, QNLI와 같은 NLI task에서 눈에 띄게 저하 $\rightarrow$ NSP task는 문장의 함의파악이나 인과관계 등의 논리적 추론을 파악하는데에 도움을 줌
  - LTR & No NSP의 결과가 No NSP에 비해 낮음 : 문장의 유사도(MRPC)나 QA(SQuAD) task는 left/right context가 중요한 task

Effect of model size (BASE vs LARGE)
- 이미 Translation이나 LM(language modeling)과 같은 large-scale NLP task에서는 모델의 크기가 커질수록 성능이 좋아진다는 사실이 밝혀져왔음
- 하지만, 주목할 것은 small-scale task에 대해서도 모델의 크기를 키우는 것의 효과가 BERT에서는 유효하다는 것 $\rightarrow$ pre-training의 효과

Feature-based approach with BERT
- 앞의 BERT실험결과들은 pre-trained BERT에 classification layer를 붙여 fine-tuning하는 approach 측면만을 다루었음
- 앞에서 언급한대로, feature-based approach 또한 존재하는데 BERT를 이와 같이 이용할 수 있음
- Feature-based approach는 다음과 같이 두 가지 장점을 가짐
  - Task specific한 model을 추가할 필요가 없어짐
    - Feature-based approach는 pre-trained model에 대해 fixed feature를 뽑아내서 이용하기 때문
    - Transformer encoder구조가 모든 task를 쉽게 표현할 수 없기 때문에 — classification을 위해 softmax layer를 더하는 것을 미루어보아 알 수 있음 — task specific한 모델을 추가해야함
  - Computational benefit
    - Representation(이것이 굉장히 expensive)을 한번 해놓으면(fixed feature) 단순히 이 representation을 이용하여 실험만 반복적으로 하면 되기때문
- Result
  - CoNLL-2003 NER task에 실험하였음
  - Fine-tuning task를 배제하기 위해, fine-tuning parameter없이 hidden state activations를 뽑아 embedding으로 사용하고, randomly-initialized BiLSTM(768-dimension, 2 layers)에 해당 input을 사용
  - Feature-based approach로 실험한 결과 중 best performance(=96.1)가 fine-tuning의 결과(=96.4)와 불과 0.3차이 $\rightarrow$ BERT의 feature-based approach로써의 성능도 입증됨

Reference

https://iq.opengenus.org/native-language-identification-dl/

https://tmaxai.github.io/post/BERT/

저작자표시 비영리 동일조건

'AI > NLP' 카테고리의 다른 글

Evaluation Metrics for Language Models (0)	2024.01.10
GPT decides to stop generation: Semantics of the Unwritten, The Effect of End of Paragraph ... (0)	2024.01.09
Open Source GPT-3 (GPT-Neo, GPT-J) (0)	2024.01.08
Transformer-based Seq2Seq: Leveraging Pre-trained Checkpoints for Sequence Generation Task (0)	2022.07.07
GPT-2 : Language Models are Unsupervised Multitask Learners (0)	2022.06.29

숨니의 무작정 따라하기

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Introduction

Method

Model architecture

Input/output representation

Pre-training of BERT

Fine-tuning of BERT

Experiments

Ablation Studies

Reference

'AI > NLP' 카테고리의 다른 글

티스토리툴바

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Introduction

Method

Model architecture

Input/output representation

Pre-training of BERT

Fine-tuning of BERT

Experiments

Ablation Studies

Reference

'AI > NLP' 카테고리의 다른 글

'AI/NLP' Related Articles

티스토리툴바