PyTorch代码转HF


介绍

这篇博客主要介绍了怎么把一个已有的Pytorch代码转变成HF支持的格式,然后可以方便的放入HF代码流程中,并且使用一些HF的函数。代码转换主要涉及到以下几个方面:

  • Config
  • Model
  • Trainer
  • Dataset

因为ckpt里面的代码使用的会是相对导入,所以在转换的过程中,建议把 configuration_xxx.pymodeling_xxx.py文件放在同一个目录下,并且添加 __init__.py文件。

Config

参考:Building custom models (huggingface.co)

示例:

from transformers import PretrainedConfig
from typing import List


class LtgBertConfig(PretrainedConfig):
    model_type = "LtgBert"

    """Configuration class to store the configuration of a `LtgBertModel`.
    """
    def __init__(self,
                 vocab_size_or_config_json_file=16384,
                 hidden_size=768,
                 num_hidden_layers=12,
                 num_attention_heads=12,
                 intermediate_size=3072,
                 hidden_act="gelu",
                 hidden_dropout_prob=0.1,
                 attention_probs_dropout_prob=0.1,
                 max_position_embeddings=512,
                 type_vocab_size=2,
                 initializer_range=0.02,
                 output_all_encoded_layers=False,
                 require_all_hidden_states=True,
                 batch_first=True,
                 **kwargs):
        self.hidden_size = hidden_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.hidden_act = hidden_act
        self.intermediate_size = intermediate_size
        self.hidden_dropout_prob = hidden_dropout_prob
        self.attention_probs_dropout_prob = attention_probs_dropout_prob
        self.max_position_embeddings = max_position_embeddings
        self.type_vocab_size = type_vocab_size
        self.initializer_range = initializer_range
        self.output_all_encoded_layers = output_all_encoded_layers
        self.require_all_hidden_states = require_all_hidden_states
        self.batch_first=batch_first

        if isinstance(vocab_size_or_config_json_file, str) or (sys.version_info[0] == 2
                        and isinstance(vocab_size_or_config_json_file, unicode)):
            with open(vocab_size_or_config_json_file, "r", encoding='utf-8') as reader:
                json_config = json.loads(reader.read())
            for key, value in json_config.items():
                self.__dict__[key] = value
        elif isinstance(vocab_size_or_config_json_file, int):
            self.vocab_size = vocab_size_or_config_json_file
        else:
            raise ValueError("First argument must be either a vocabulary size (int)"
                             "or the path to a pretrained model config file (str)")
        super(LtgBertConfig, self).__init__(**kwargs)

必须满足:

  • 继承自 PretrainedConfig
  • __init__函数接受 kwargs,并且使用 super()).__init__传递这些参数

model_type的作用是把模型注册到 AutoClass中,建议设置。

Model

参考:Building custom models (huggingface.co)

示例:

class LtgBertForMaskedLM(PreTrainedModel):
    config_class=LtgBertConfig
  
    def __init__(self,config,activation_checkpointing=False):
        super().__init__(config)
	# 这里可以把成员变成类的继承LtgBertForMaskedLM(Bert):
        self.model=Bert(
            config=config,
            activation_checkpointing=activation_checkpointing
        )
        self.require_all_hidden_states=config.require_all_hidden_states
        self.batch_first=config.batch_first
  
    def forward(self, input_ids, attention_mask, masked_lm_labels=None):
        if self.batch_first:
            # 模型把batch放在第二个维度
            input_ids=input_ids.transpose(0,1)
            if masked_lm_labels is not None:
                masked_lm_labels=masked_lm_labels.transpose(0,1)
        subword_prediction=self.model(input_ids, attention_mask, masked_lm_labels=masked_lm_labels)
        loss=None
        if masked_lm_labels is not None:
            target_ids = masked_lm_labels.flatten()
            target_ids = target_ids[target_ids != -100]
            loss = F.cross_entropy(subword_prediction, target_ids)
        all_hidden_states=None
        if self.require_all_hidden_states:
            all_hidden_states=self.model.get_contextualized(input_ids=input_ids,attention_mask=attention_mask)
        if self.batch_first:
            if len(subword_prediction.size())>2:
                subword_prediction=subword_prediction.transpose(0,1)
            if all_hidden_states is not None:
                all_hidden_states=[it.transpose(0,1) for it in all_hidden_states]
        return MaskedLMOutput(
            loss=loss,
            logits=subword_prediction,
            hidden_states=all_hidden_states,
            attentions=None
        )

对于自定义模型,往往每个 AutoClass上都会注册一个模型,因此往往要写多个自定义模型。

config_type的作用是把模型注册到 AutoClass中,建议设置。

由于简约性原则,官方要求 self.model对应原来的模型,比如用Pytorch定义的模型。

forward函数需要注意结果格式,transformers.modeling_outputs里定义了每种模型forward的结果格式。

其中对于每个特定的子任务都有个类似的模型,对于部分函数比如forward建议参考已有的代码进行操作,因为hf框架在使用特定子任务的模型的时候,可能会添加特殊的参数。比如,对于序列分类任务SequenceClassification,其中相关模型的forward为:

def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        inputs_embeds: Optional[torch.Tensor] = None,
        return_dict: Optional[bool] = None,
        labels: Optional[torch.LongTensor] = None,
    ) -> Union[Tuple[torch.Tensor], SequenceClassifierOutput]:
        if self.batch_first:
            # 模型把batch放在第二个维度
            input_ids=input_ids.transpose(0,1)
        contextualized_embeddings=self.model.get_contextualized(input_ids, attention_mask)
        if self.batch_first:
            contextualized_embeddings=contextualized_embeddings.transpose(0,1)
        logits = self.head(contextualized_embeddings[:, 0, :])
        if labels is not None:
            if self.config.problem_type is None:
                if self.num_labels == 1:
                    self.config.problem_type = "regression"
                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
                    self.config.problem_type = "single_label_classification"
                else:
                    self.config.problem_type = "multi_label_classification"

    if self.config.problem_type == "regression":
                loss_fct = nn.MSELoss()
                if self.num_labels == 1:
                    loss = loss_fct(logits.squeeze(), labels.squeeze())
                else:
                    loss = loss_fct(logits, labels)
            elif self.config.problem_type == "single_label_classification":
                loss_fct = nn.CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            elif self.config.problem_type == "multi_label_classification":
                loss_fct = nn.BCEWithLogitsLoss()
                loss = loss_fct(logits, labels)

    assert output_attentions is None
        assert output_hidden_states is None
        return SequenceClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=contextualized_embeddings if output_hidden_states else None,
            attentions=None
        )

这里hf框架会在配置中添加problem_type等内容。

注册

如果在ckpt文件夹的 config.json里没有 auto_map指明 AutoClass的注册:

"auto_map": {
  "AutoConfig": "configuration_ltgbert.LtgBertConfig",
  "AutoModelForMaskedLM": "modeling_ltgbert.LtgBertForMaskedLM"
}

那么需要手动添加,在读取ckpt的代码里添加:

AutoConfig.register("LtgBert", LtgBertConfig)
AutoModelForMaskedLM.register(LtgBertConfig, LtgBertForMaskedLM)

如果希望在保存模型的时候 config.json文件中自动包含 auto_map,可以添加以下代码(如果模型是从ckpt里加载的就不需要添加):

LtgBertConfig.register_for_auto_class()
LtgBertForMaskedLM.register_for_auto_class("AutoModelForMaskedLM")

后来发现只有注册可能会存在 config.jsonauto_map不完整的情况(原因暂时没有调查),可以考虑直接在 config.__init__里强制指定:

def __init__(self,....):
   ...
   self.auto_map={
    "AutoConfig": "configuration_ltgbert.LtgBertConfig",
    "AutoModelForMaskedLM": "modeling_ltgbert.LtgBertForMaskedLM"
   }

Trainer

训练流程的转换主要设计HF的 Trainer类,可以参考Trainer (huggingface.co)Trainer (huggingface.co)

Trainer把训练的流程分为几个过程,通过继承以及重写相关函数即可完成流程的定制,通过参数即可实现超参数的设置,细节阅读参考资料。

Dataset

dataset可以继承自 torch.utils.data.dataset,但是需要注意 __getitem__,默认情况该函数返回的需要满足 dict格式,从而实现参数的设置。

参考资料


文章作者: bg51717
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 bg51717 !
由于评论系统依托于Github的Discuss存在,因此默认评论者会收到所有通知。可以在邮件里点击"unsubscribe"停止接受,后续也可以点击下列仓库进行通知管理: bg51717/Hexo-Blogs-comments
Since the comment system relies on GitHub's Discussions feature, by default, commentators will receive all notifications. You can click "unsubscribe" in the email to stop receiving them, and you can also manage your notifications by clicking on the following repositories: bg51717/Hexo-Blogs-comments
  目录