BERT - pytorch 구현

Advance Deep Learning/NLP

BERT - pytorch 구현

needmorecaffeine 2023. 2. 21. 20:27

본 포스팅은 아래 깃헙과 포스팅을 참고하여 작성되었습니다.

https://github.com/codertimo/BERT-pytorch

GitHub - codertimo/BERT-pytorch: Google AI 2018 BERT pytorch implementation

Google AI 2018 BERT pytorch implementation. Contribute to codertimo/BERT-pytorch development by creating an account on GitHub.

github.com

https://needmorecaffeine.tistory.com/30

BERT - 이론

본 포스팅은 아래의 자료와 강의를 기반으로 작성되었습니다. BERT Paper 위키독스, 딥러닝을 이용한 자연어 처리 입문 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding We introduce a new l

needmorecaffeine.tistory.com

1. Transformer 구조

transformer의 인코더로 layer를 쌓는 만큼, transformer의 구조와 연산 코드를 먼저 짠다.

이론적인 내용은 다음 포스팅에서 확인할 수 있다.

https://needmorecaffeine.tistory.com/29

Attention is All You Need (Transformer)

해당 포스팅은 다음의 유튜브 강의와 논문을 기반으로 작성되었습니다. 고려대 산경공 DSBA 논문 리뷰 강의 Attention is All You Need 논문 위키독스, 딥러닝을 이용한 자연어처리 입문 Attention Is All You

needmorecaffeine.tistory.com

1-2. Multi-Head Attention

"""
single.py
"""

import torch
import torch.nn.functional as F
import torch
import math


# Scaled Dot Product Attention

class Attention(nn.Module) :
    
    def forward(self, query, key, value, mask = None, dropout = None) :
        scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(query.size(-1))
        
        if mask is not None :
            scores = scores.masked_fill(mask == 0, -1e9)
            
        p_attn = F.softmax(scores, dim = 1)
        
        if dropout is not None :
            p_attn = dropout(p_attn)
            
        return torch.matmul(p_attn, value), p_attn

import torch.nn as nn
from .single import Attention

"""
multi_head.py
"""

class MultiHeadAttention(nn.Module) :
    
    def __init__(self, h, d_model, dropout = 0.1) :
        super().__init()
        assert d_model % h == 0
        
        self.d_k = d_model // h
        self.h = h
        
        self.linear_layers = nn.ModuleList([nn.Linear(d_model, d_model) for _ in range(3)])
        self.output_linear = nn.Linear(d_model, d_model)
        self.attention = Attention()
        self.dropout = nn.Dropout(p = dropout)
        
    def forward(self, query, key, value, mask = None) :
        batch_size = query.size()
        query, key, value = [l(x).view(batch_size, -1, self.h, self.d_k).transpose(1,2) for l, x in zip(self.linear_layers, (query, key, value))]
        
        x, attn = self.attention(query, key, value, mask = mask, dropout = self.dropout)
        
        x = x.transpose(1,2).contiguous().view(batch_size, -1, self.h * self.d_k)
        
        return self.output_linear(x)

이 외에도 feed forward, gelu activation, layer normalization, residual connectoin 과 같은 연산은 untils 폴더에 각각 py 파일로 사전에 짜두었다.

최종적인 transformer encoder block은 다음과 같다.

import torch.nn as nn

from .attention import MultiHeadedAttention
from .utils import SublayerConnection, PositionwiseFeedForward

"""
transformer.py
"""

class TransformerBlock(nn.Module):
    """
    Bidirectional Encoder = Transformer (self-attention)
    Transformer = MultiHead_Attention + Feed_Forward with sublayer connection
    """

    def __init__(self, hidden, attn_heads, feed_forward_hidden, dropout):
        """
        :param hidden: hidden size of transformer
        :param attn_heads: head sizes of multi-head attention
        :param feed_forward_hidden: feed_forward_hidden, usually 4*hidden_size
        :param dropout: dropout rate
        """

        super().__init__()
        self.attention = MultiHeadedAttention(h=attn_heads, d_model=hidden)
        self.feed_forward = PositionwiseFeedForward(d_model=hidden, d_ff=feed_forward_hidden, dropout=dropout)
        self.input_sublayer = SublayerConnection(size=hidden, dropout=dropout)
        self.output_sublayer = SublayerConnection(size=hidden, dropout=dropout)
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, x, mask):
        x = self.input_sublayer(x, lambda _x: self.attention.forward(_x, _x, _x, mask=mask))
        x = self.output_sublayer(x, self.feed_forward)
        return self.dropout(x)

2. Three Embeddings

BERT는 아래 세가지의 임베딩을 수행한다.

각각 class 형태로 만들어 py 파일에 저장한다.

공통적으로 nn.Embedding을 사용하는데 그 작동 원리와 인자들에 대해 알아보면 다음과 같다.

train_data = 'you need to know how to code'

# 중복을 제거한 단어들의 집합인 단어 집합 생성.
word_set = set(train_data.split())

# 단어 집합의 각 단어에 고유한 정수 맵핑.
vocab = {tkn: i+2 for i, tkn in enumerate(word_set)}
vocab['<unk>'] = 0
vocab['<pad>'] = 1

import torch.nn as nn
embedding_layer = nn.Embedding(num_embeddings=len(vocab), 
                               embedding_dim=3,
                               padding_idx=1)
                               
print(embedding_layer.weight)

num_embedding = 임베딩할 단어들의 갯수 = 단어 집합의 크기
embedding_dim = 임베딩할 벡터의 차원 (사용자 설정 값)
padding_idx = (optional) 패딩을 위한 토큰의 인덱스를 알려줌, train시 파라미터가 업데이트 되지 않음

padding_idx의 설정에 따른 차이는 아래와 같다.

2-1. Token Embedding

import torch.nn as nn

"""
segment.py
"""

class TokenEmbedding(nn.Embedding) :
    def __init__(self, vocab_size, embed_size = 512) :
        super().__init__(vocab_size, embed_size, padding_idx = 0)

2-2. Segment Embedding

import torch.nn as nn

"""
segment.py
"""

class SegmentEmbedding(nn.Embedding) :
    
    def __init__(self, embed_size = 512) :
        super().__init__(3, embed_size, padding_idx = 0)

2-3. Position Embedding

import torch.nn as nn
import torch
import math

"""
position.py
"""

class PositionalEmbedding(nn.Module) :
    
    def __init__(self, d_model, max_len = 512) :
        super().__init__()
        
        # compute positional encoding in log space
        pe = torch.zeros(max_len, d_model).float()
        pe.required_grad = False
        
        position = torch.arange(0, max_len).float().unsqueeze(1)
        div_term = (torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)).exp()
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)
        
    def forward(self, x) :
        return self.pe[:, :x.size(1)]

위 세가지 임베딩을 최종적으로 모델에 입력할 수 있는 형태로 전달하는 class는 다음과 같다.

import torch.nn as nn
from .token import TokenEmbedding
from .position import PositionalEmbedding
from .segment import SegmentEmbedding

"""
bert.py
"""

class BERTEmbedding(nn.Module):

    def __init__(self, vocab_size, embed_size, dropout=0.1):
   
        super().__init__()
        self.token = TokenEmbedding(vocab_size=vocab_size, embed_size=embed_size)
        self.position = PositionalEmbedding(d_model=self.token.embedding_dim)
        self.segment = SegmentEmbedding(embed_size=self.token.embedding_dim)
        self.dropout = nn.Dropout(p=dropout)
        self.embed_size = embed_size

    def forward(self, sequence, segment_label):
        x = self.token(sequence) + self.position(sequence) + self.segment(segment_label)
        return self.dropout(x)

3. BERT model

BERT training을 위한 모델 아키텍쳐가 거의 완성되었다.

이전에서 만든 py 파일의 class를 사용한다.

import torch.nn as nn

from .transformer import TransformerBlock
from .embedding import BERTEmbedding

"""
bert.py
"""

class BERT(nn.Module) :
    
    def __init__(self, vocab_size, hidden = 768, n_layers = 12, attn_heads = 12, dropout = 0.1) :
        super().__init__()
        self.hidden = hidden
        self.n_layer = n_layers
        self.attn_heads = attn_heads
        
        # paper : use 4*hidden_size for ff network hidden size
        self.feed_forward_hidden = hidden * 4
        
        # embedding for BERT = token + segment + position
        self.embedding = BERTEmbedding(vocab_size = vocab_size, embed_size = hidden)
        
        # transformer block
        self.transformer_blocks = nn.ModuleList(
            [TransformerBlock(hidden, attn_heads, hidden * 4, dropout) for _ in range(n_layers)]
        )
        
    def forward(self, x, segment_info) :
        # attention masking
        mask = (x > 0).unsqueeze(1).repeat(1, x.size(1), 1).unsqueeze(1)
        
        # embedding the indexed sequence to sequence of vectors
        x = self.embedding(x, segment_info)
        
        # run multiple transformer block
        for transformer in self.transformer_blocks :
            x = transformer.forward(x, mask)
        
        return x

이제는 pre-training 단계에서 학습을 진행하는 language model로 Masked Language Model(MLM), Next Sentence Prediction(NSP)를 정의한다.

3-1. MLM

"""
language_model.py
"""

class MaskedLanguageModel(nn.Module) :
    def __init__(self, hidden, vocab_size) :
        super().__init__()
        self.linear = nn.Linear(hidden, vocab_size)
        self.softmax = nn.LogSoftmax(dim = -1)
        
    def forward(self, x) :
        return self.softmax(self.linear(x))

3-2. NSP

"""
language_model.py
"""

class NextSentencePrediction(nn.Module) :
    
    def __init__(self, hidden) :
        super().__init__()
        self.linear = nn.Linear(hidden, 2)
        self.softmaz = nn.LogSoftmax(dim = -1)
        
    def forward(self, x) :
        return self.softmax(self.linear(x[:, 0]))

language model로 위 두 학습을 진행하는 전체 코드는 다음과 같다.

"""
language_model.py
"""

class BERTLTM(nn.Module) :
    
    def __init__(self, bert : BERT, vocab_size) :
        super().__init__()
        self.bert = bert
        self.next_sentence = NextSentencePrediction(self.bert.hidden)
        self.mask_lm = MaskedLanguageModel(self.bert.hidden, vocab_size)
        
    def forward(self, x, segment_label) :
        x = self.bert(x, segment_label)
        return self.next_sentence(x), self.mask_lm

4. Training

구조 구현은 모두 마쳤다.

여기서 실제 학습을 진행하지는 않았지만 학습을 위한 데이터는 아래 링크에서 확인할 수 있다.

https://github.com/codertimo/BERT-pytorch/tree/master/bert_pytorch/dataset

GitHub - codertimo/BERT-pytorch: Google AI 2018 BERT pytorch implementation

Google AI 2018 BERT pytorch implementation. Contribute to codertimo/BERT-pytorch development by creating an account on GitHub.

github.com

import torch
import torch.nn as nn
from torch.optim import Adam
from torch.utils.data import DataLoader

from ..model import BERTLM, BERT
from .optim_schedule import ScheduledOptim

import tqdm

class BERTTrainer :
    
    def __init__(self, bert : BERT, vocab_size : int, train_dataloader : DataLoader, test_dataloader : DataLoader = None,
                 lr : float = 1e-4, betas = (0.9, 0.999), weight_decay : float = 0.01, warmup_steps = 10000,
                 with_cuda : bool = True, cuda_devices = None, log_freq : int = 10) :
        
        """
        :param bert: BERT model which you want to train
        :param vocab_size: total word vocab size
        :param train_dataloader: train dataset data loader
        :param test_dataloader: test dataset data loader [can be None]
        :param lr: learning rate of optimizer
        :param betas: Adam optimizer betas
        :param weight_decay: Adam optimizer weight decay param
        :param with_cuda: traning with cuda
        :param log_freq: logging frequency of the batch iteration
        """
        
                # Setup cuda device for BERT training, argument -c, --cuda should be true
        cuda_condition = torch.cuda.is_available() and with_cuda
        self.device = torch.device("cuda:0" if cuda_condition else "cpu")

        # This BERT model will be saved every epoch
        self.bert = bert
        # Initialize the BERT Language Model, with BERT model
        self.model = BERTLM(bert, vocab_size).to(self.device)

        # Distributed GPU training if CUDA can detect more than 1 GPU
        if with_cuda and torch.cuda.device_count() > 1:
            print("Using %d GPUS for BERT" % torch.cuda.device_count())
            self.model = nn.DataParallel(self.model, device_ids=cuda_devices)

        # Setting the train and test data loader
        self.train_data = train_dataloader
        self.test_data = test_dataloader

        # Setting the Adam optimizer with hyper-param
        self.optim = Adam(self.model.parameters(), lr=lr, betas=betas, weight_decay=weight_decay)
        self.optim_schedule = ScheduledOptim(self.optim, self.bert.hidden, n_warmup_steps=warmup_steps)

        # Using Negative Log Likelihood Loss function for predicting the masked_token
        self.criterion = nn.NLLLoss(ignore_index=0)

        self.log_freq = log_freq

        print("Total Parameters:", sum([p.nelement() for p in self.model.parameters()]))

    def train(self, epoch):
        self.iteration(epoch, self.train_data)

    def test(self, epoch):
        self.iteration(epoch, self.test_data, train=False)

    def iteration(self, epoch, data_loader, train=True):
        """
        loop over the data_loader for training or testing
        if on train status, backward operation is activated
        and also auto save the model every peoch
        :param epoch: current epoch index
        :param data_loader: torch.utils.data.DataLoader for iteration
        :param train: boolean value of is train or test
        :return: None
        """
        str_code = "train" if train else "test"

        # Setting the tqdm progress bar
        data_iter = tqdm.tqdm(enumerate(data_loader),
                              desc="EP_%s:%d" % (str_code, epoch),
                              total=len(data_loader),
                              bar_format="{l_bar}{r_bar}")

        avg_loss = 0.0
        total_correct = 0
        total_element = 0

        for i, data in data_iter:
            # 0. batch_data will be sent into the device(GPU or cpu)
            data = {key: value.to(self.device) for key, value in data.items()}

            # 1. forward the next_sentence_prediction and masked_lm model
            next_sent_output, mask_lm_output = self.model.forward(data["bert_input"], data["segment_label"])

            # 2-1. NLL(negative log likelihood) loss of is_next classification result
            next_loss = self.criterion(next_sent_output, data["is_next"])

            # 2-2. NLLLoss of predicting masked token word
            mask_loss = self.criterion(mask_lm_output.transpose(1, 2), data["bert_label"])

            # 2-3. Adding next_loss and mask_loss : 3.4 Pre-training Procedure
            loss = next_loss + mask_loss

            # 3. backward and optimization only in train
            if train:
                self.optim_schedule.zero_grad()
                loss.backward()
                self.optim_schedule.step_and_update_lr()

            # next sentence prediction accuracy
            correct = next_sent_output.argmax(dim=-1).eq(data["is_next"]).sum().item()
            avg_loss += loss.item()
            total_correct += correct
            total_element += data["is_next"].nelement()

            post_fix = {
                "epoch": epoch,
                "iter": i,
                "avg_loss": avg_loss / (i + 1),
                "avg_acc": total_correct / total_element * 100,
                "loss": loss.item()
            }

            if i % self.log_freq == 0:
                data_iter.write(str(post_fix))

        print("EP%d_%s, avg_loss=" % (epoch, str_code), avg_loss / len(data_iter), "total_acc=",
              total_correct * 100.0 / total_element)

    def save(self, epoch, file_path="output/bert_trained.model"):
        """
        Saving the current BERT model on file_path
        :param epoch: current epoch number
        :param file_path: model output path which gonna be file_path+"ep%d" % epoch
        :return: final_output_path
        """
        output_path = file_path + ".ep%d" % epoch
        torch.save(self.bert.cpu(), output_path)
        self.bert.to(self.device)
        print("EP:%d Model Saved on:" % epoch, output_path)
        return output_path

위의 코드들과 디렉 구조는 깃헙에 아카이빙하였다.

https://github.com/needmoreamericano/NLP_Scratch2/edit/main/BERT

GitHub - needmoreamericano/NLP_Scratch2

Contribute to needmoreamericano/NLP_Scratch2 development by creating an account on GitHub.

github.com

Ref

https://github.com/codertimo/BERT-pytorch

GitHub - codertimo/BERT-pytorch: Google AI 2018 BERT pytorch implementation

Google AI 2018 BERT pytorch implementation. Contribute to codertimo/BERT-pytorch development by creating an account on GitHub.

github.com

'Advance Deep Learning > NLP' 카테고리의 다른 글

Word2Vec - CBOW & Skip-Gram (0)	2023.02.23
BERT - 이론 (0)	2023.02.20
Attention is All You Need (Transformer) (0)	2023.02.13
GPT (0)	2023.02.10

현재글BERT - pytorch 구현

김태영 공부 아카이빙

지수평활법, 시계열 모델링, 홀츠윈터, 시계열 검정, 비정상성, 정상성, 자기회귀, 시계열, autoregressive, 시계열 데이터, AUTO ARIMA, Holt Winter, 정상성 교정, Holt-Winter, forward fill, 정상성 검정, 자기 상관성, 포워드필, 시계열 분해, Exponential smoothing,

Today :
Yesterday :

일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

김퉤퉤