PyTorch代码转HF

模板深度学习 PyTorch HuggingFace Trainer config model dataset

模板

发布日期: 2024-08-06

更新日期: 2024-12-10

文章字数: 2.1k

阅读时长: 9 分

介绍

这篇博客主要介绍了怎么把一个已有的Pytorch代码转变成HF支持的格式，然后可以方便的放入HF代码流程中，并且使用一些HF的函数。代码转换主要涉及到以下几个方面：

Config
Model
Trainer
Dataset

因为ckpt里面的代码使用的会是相对导入，所以在转换的过程中，建议把 configuration_xxx.py和 modeling_xxx.py文件放在同一个目录下，并且添加 __init__.py文件。

在hf框架下模型的ckpt文件夹里面，我们一般可以看到以下几个文件（假如模型是ltgbert）：

config.json：存放模型的具体设置，比如num_heads，act_fn等。
configuration_ltgbert.py：存放关于模型 config类的代码。
modeling_ltgbert.py：模型类以及各种组件模块的代码，也会包含一些功能函数。
pytorch_model.bin：模型的权重，存放模型参数具体的值。
special_token_map.json：模型的 tokenizer的特殊token。
tokenizer_config.json：模型的 tokenizer的设置，参考 config.json。
tokenizer.json：模型的 token以及序号，等同于 tokenizer的权重。

部分 ckpt里如果有自定义的 tokenizer，可能还会有相关的代码定义，此处暂不列出。~~（挖坑~~

部分 ckpt里模型的权重可能会使用多个文件拆分存放或者多种格式的，此处暂不列出。

模型的设置和类的代码可能不会在文件中出现，因为hf官方会把一些常用的模型的相关代码放在官方库里，该文件夹在github上的网址为：transformers/src/transformers/models at main · huggingface/transformers。

尽管文件的 ckpt里文件较多，但是框架可以帮你自动保存相关文件，当然也可以通过重载函数实现自定义的保存过程。~~（挖坑~~

Config

参考：Building custom models (huggingface.co)

代码主要在 configuration_xxx.py。

config主要是在模型基本架构不变的情况下，怎么通过超参设置模型的具体结构，在模型初始化或者加载权重的时候会传入的config进行模型设置。

示例：

from transformers import PretrainedConfig
from typing import List


class LtgBertConfig(PretrainedConfig):
    model_type = "LtgBert"

    """Configuration class to store the configuration of a `LtgBertModel`.
    """
    def __init__(self,
                 vocab_size_or_config_json_file=16384,
                 hidden_size=768,
                 num_hidden_layers=12,
                 num_attention_heads=12,
                 intermediate_size=3072,
                 hidden_act="gelu",
                 hidden_dropout_prob=0.1,
                 attention_probs_dropout_prob=0.1,
                 max_position_embeddings=512,
                 type_vocab_size=2,
                 initializer_range=0.02,
                 output_all_encoded_layers=False,
                 require_all_hidden_states=True,
                 batch_first=True,
                 **kwargs):
        self.hidden_size = hidden_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.hidden_act = hidden_act
        self.intermediate_size = intermediate_size
        self.hidden_dropout_prob = hidden_dropout_prob
        self.attention_probs_dropout_prob = attention_probs_dropout_prob
        self.max_position_embeddings = max_position_embeddings
        self.type_vocab_size = type_vocab_size
        self.initializer_range = initializer_range
        self.output_all_encoded_layers = output_all_encoded_layers
        self.require_all_hidden_states = require_all_hidden_states
        self.batch_first=batch_first

        if isinstance(vocab_size_or_config_json_file, str) or (sys.version_info[0] == 2
                        and isinstance(vocab_size_or_config_json_file, unicode)):
            with open(vocab_size_or_config_json_file, "r", encoding='utf-8') as reader:
                json_config = json.loads(reader.read())
            for key, value in json_config.items():
                self.__dict__[key] = value
        elif isinstance(vocab_size_or_config_json_file, int):
            self.vocab_size = vocab_size_or_config_json_file
        else:
            raise ValueError("First argument must be either a vocabulary size (int)"
                             "or the path to a pretrained model config file (str)")
        super(LtgBertConfig, self).__init__(**kwargs)

必须满足：

继承自 PretrainedConfig
__init__函数接受 kwargs，并且使用 super()).__init__传递这些参数

model_type的作用是把模型注册到 AutoClass中，建议设置。

Model

参考：Building custom models (huggingface.co)

modeling_xxx.py会存放模型的函数，模块，主要类的代码。

函数按照正常定义即可，除了函数以外也可以存放一些常量。

模块基本是需要继承自 nn.Module，关于模块的各种知识参考 pytorch教程。

首先模型需要一个基类，需要是 transformers.PreTrainedModel的子类，然后其 self.model需要是具体的模型主体，需要实现 __init__和 forward函数，并且部分参数需要固定，可以参考官方库里的模型源码。

然后针对每一个下游任务，都需要有对应的类，比如自回归任务需要 LlamaForCausalLM。具体要求基本和模型的基类差不多。但是一般会针对下游任务额外多一个分类头等小组件，然后 forward函数的部分固定参数可能会变化，基本也是根据已有的源码模仿即可。

此外，每个模型的类可能会有一些特殊的成员属性，比如 _tp_plan = {"lm_head": "colwise_rep"}，是用于配合框架进行模型判断的，一般不设置也没有关系，代码执行时如果遇到缺少该属性的时候框架也会提示。

我们可以以 llama模型举例，查看其模型定义文件里所有类：

模块类：
- class LlamaRMSNorm(nn.Module)
- class LlamaRotaryEmbedding(nn.Module)
- class LlamaLinearScalingRotaryEmbedding(LlamaRotaryEmbedding)
- class LlamaDynamicNTKScalingRotaryEmbedding(LlamaRotaryEmbedding)
- class LlamaMLP(nn.Module)
- class LlamaAttention(nn.Module)
- class LlamaFlashAttention2(LlamaAttention)
- class LlamaSdpaAttention(LlamaAttention)
- class LlamaDecoderLayer(nn.Module)
模型类：
- class LlamaPreTrainedModel(PreTrainedModel)
- class LlamaModel(LlamaPreTrainedModel)
- class LlamaForCausalLM(LlamaPreTrainedModel, GenerationMixin)
- class LlamaForSequenceClassification(LlamaPreTrainedModel)
- class LlamaForQuestionAnswering(LlamaPreTrainedModel)
- class LlamaForTokenClassification(LlamaPreTrainedModel)
函数：
- 对于其中的函数不予列出，因为这没有什么理解难度。

示例：

class LtgBertForMaskedLM(PreTrainedModel):
    config_class=LtgBertConfig
  
    def __init__(self,config,activation_checkpointing=False):
        super().__init__(config)
	# 这里可以把成员变成类的继承LtgBertForMaskedLM(Bert):
        self.model=Bert(
            config=config,
            activation_checkpointing=activation_checkpointing
        )
        self.require_all_hidden_states=config.require_all_hidden_states
        self.batch_first=config.batch_first
  
    def forward(self, input_ids, attention_mask, masked_lm_labels=None):
        if self.batch_first:
            # 模型把batch放在第二个维度
            input_ids=input_ids.transpose(0,1)
            if masked_lm_labels is not None:
                masked_lm_labels=masked_lm_labels.transpose(0,1)
        subword_prediction=self.model(input_ids, attention_mask, masked_lm_labels=masked_lm_labels)
        loss=None
        if masked_lm_labels is not None:
            target_ids = masked_lm_labels.flatten()
            target_ids = target_ids[target_ids != -100]
            loss = F.cross_entropy(subword_prediction, target_ids)
        all_hidden_states=None
        if self.require_all_hidden_states:
            all_hidden_states=self.model.get_contextualized(input_ids=input_ids,attention_mask=attention_mask)
        if self.batch_first:
            if len(subword_prediction.size())>2:
                subword_prediction=subword_prediction.transpose(0,1)
            if all_hidden_states is not None:
                all_hidden_states=[it.transpose(0,1) for it in all_hidden_states]
        return MaskedLMOutput(
            loss=loss,
            logits=subword_prediction,
            hidden_states=all_hidden_states,
            attentions=None
        )

对于自定义模型，往往每个 AutoClass上都会注册一个模型，因此往往要写多个自定义模型。

config_type的作用是把模型注册到 AutoClass中，建议设置。

由于简约性原则，官方要求 self.model对应原来的模型，比如用Pytorch定义的模型。

forward函数需要注意结果格式，transformers.modeling_outputs里定义了每种模型forward的结果格式。

其中对于每个特定的子任务都有个类似的模型，对于部分函数比如forward建议参考已有的代码进行操作，因为hf框架在使用特定子任务的模型的时候，可能会添加特殊的参数。比如，对于序列分类任务SequenceClassification，其中相关模型的forward为：

def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        inputs_embeds: Optional[torch.Tensor] = None,
        return_dict: Optional[bool] = None,
        labels: Optional[torch.LongTensor] = None,
    ) -> Union[Tuple[torch.Tensor], SequenceClassifierOutput]:
        if self.batch_first:
            # 模型把batch放在第二个维度
            input_ids=input_ids.transpose(0,1)
        contextualized_embeddings=self.model.get_contextualized(input_ids, attention_mask)
        if self.batch_first:
            contextualized_embeddings=contextualized_embeddings.transpose(0,1)
        logits = self.head(contextualized_embeddings[:, 0, :])
        if labels is not None:
            if self.config.problem_type is None:
                if self.num_labels == 1:
                    self.config.problem_type = "regression"
                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
                    self.config.problem_type = "single_label_classification"
                else:
                    self.config.problem_type = "multi_label_classification"

    if self.config.problem_type == "regression":
                loss_fct = nn.MSELoss()
                if self.num_labels == 1:
                    loss = loss_fct(logits.squeeze(), labels.squeeze())
                else:
                    loss = loss_fct(logits, labels)
            elif self.config.problem_type == "single_label_classification":
                loss_fct = nn.CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            elif self.config.problem_type == "multi_label_classification":
                loss_fct = nn.BCEWithLogitsLoss()
                loss = loss_fct(logits, labels)

    assert output_attentions is None
        assert output_hidden_states is None
        return SequenceClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=contextualized_embeddings if output_hidden_states else None,
            attentions=None
        )

这里hf框架会在配置中添加problem_type等内容。

注册

如果在ckpt文件夹的 config.json里没有 auto_map指明 AutoClass的注册：

"auto_map": {
  "AutoConfig": "configuration_ltgbert.LtgBertConfig",
  "AutoModelForMaskedLM": "modeling_ltgbert.LtgBertForMaskedLM",
  "AutoModelForSequenceClassification": "modeling_ltgbert.LtgBertForSequenceClassification",
  "AutoModelForCausalLM": "modeling_ltgbert.LtgBertForCausalLM"
}

那么需要手动添加，在读取ckpt的代码里添加：

AutoConfig.register("LtgBert", LtgBertConfig)
AutoModelForMaskedLM.register(LtgBertConfig, LtgBertForMaskedLM)
AutoModelForSequenceClassification.register(LtgBertConfig, LtgBertForSequenceClassification)
AutoModelForCausalLM.register(LtgBertConfig, AutoModelForCausalLM)

如果希望在保存模型的时候 config.json文件中自动包含 auto_map，可以在模型代码文件里添加以下代码（如果模型是从ckpt里加载的就不需要添加）：

LtgBertConfig.register_for_auto_class()
LtgBertForMaskedLM.register_for_auto_class("AutoModelForMaskedLM")
LtgBertForSequenceClassification.register_for_auto_class("AutoModelForSequenceClassification")
LtgBertForCausalLM.register_for_auto_class("AutoModelForCausalLM")

后来发现只有注册可能会存在 config.json里 auto_map不完整的情况（原因暂时没有调查），可以考虑直接在 config.__init__里强制指定：

def __init__(self,....):
   ...
   self.auto_map={
    "AutoConfig": "configuration_ltgbert.LtgBertConfig",
    "AutoModelForMaskedLM": "modeling_ltgbert.LtgBertForMaskedLM",
    "AutoModelForSequenceClassification": "modeling_ltgbert.LtgBertForSequenceClassification",
    "AutoModelForCausalLM": "modeling_ltgbert.LtgBertForCausalLM"
   }

Trainer

训练流程的转换主要设计HF的 Trainer类，可以参考Trainer (huggingface.co)和Trainer (huggingface.co)。

Trainer把训练的流程分为几个过程，通过继承以及重写相关函数即可完成流程的定制，通过参数即可实现超参数的设置，细节阅读参考资料。

Dataset

dataset可以继承自 torch.utils.data.dataset，但是需要注意 __getitem__，默认情况该函数返回的需要满足 dict格式，从而实现参数的设置。

参考资料

Building custom models (huggingface.co)

Trainer (huggingface.co)

Trainer (huggingface.co)

bg51717

https://bg51717.github.io/61054/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 bg51717 !

模板深度学习 PyTorch HuggingFace Trainer config model dataset

由于评论系统依托于Github的Discuss存在，因此默认评论者会收到所有通知。可以在邮件里点击"unsubscribe"停止接受，后续也可以点击下列仓库进行通知管理： bg51717/Hexo-Blogs-comments
Since the comment system relies on GitHub's Discussions feature, by default, commentators will receive all notifications. You can click "unsubscribe" in the email to stop receiving them, and you can also manage your notifications by clicking on the following repositories: bg51717/Hexo-Blogs-comments