介绍
这篇博客主要介绍了怎么把一个已有的Pytorch代码转变成HF支持的格式,然后可以方便的放入HF代码流程中,并且使用一些HF的函数。代码转换主要涉及到以下几个方面:
Config
Model
Trainer
Dataset
因为ckpt里面的代码使用的会是相对导入,所以在转换的过程中,建议把
configuration_xxx.py
和
modeling_xxx.py
文件放在同一个目录下,并且添加
__init__.py
文件。
Config
参考:Building
custom models (huggingface.co)
示例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 from transformers import PretrainedConfigfrom typing import List class LtgBertConfig (PretrainedConfig ): model_type = "LtgBert" """Configuration class to store the configuration of a `LtgBertModel`. """ def __init__ (self, vocab_size_or_config_json_file=16384 , hidden_size=768 , num_hidden_layers=12 , num_attention_heads=12 , intermediate_size=3072 , hidden_act="gelu" , hidden_dropout_prob=0.1 , attention_probs_dropout_prob=0.1 , max_position_embeddings=512 , type_vocab_size=2 , initializer_range=0.02 , output_all_encoded_layers=False , require_all_hidden_states=True , batch_first=True , **kwargs ): self.hidden_size = hidden_size self.num_hidden_layers = num_hidden_layers self.num_attention_heads = num_attention_heads self.hidden_act = hidden_act self.intermediate_size = intermediate_size self.hidden_dropout_prob = hidden_dropout_prob self.attention_probs_dropout_prob = attention_probs_dropout_prob self.max_position_embeddings = max_position_embeddings self.type_vocab_size = type_vocab_size self.initializer_range = initializer_range self.output_all_encoded_layers = output_all_encoded_layers self.require_all_hidden_states = require_all_hidden_states self.batch_first=batch_first if isinstance (vocab_size_or_config_json_file, str ) or (sys.version_info[0 ] == 2 and isinstance (vocab_size_or_config_json_file, unicode)): with open (vocab_size_or_config_json_file, "r" , encoding='utf-8' ) as reader: json_config = json.loads(reader.read()) for key, value in json_config.items(): self.__dict__[key] = value elif isinstance (vocab_size_or_config_json_file, int ): self.vocab_size = vocab_size_or_config_json_file else : raise ValueError("First argument must be either a vocabulary size (int)" "or the path to a pretrained model config file (str)" ) super (LtgBertConfig, self).__init__(**kwargs)
必须满足:
继承自 PretrainedConfig
__init__
函数接受 kwargs
,并且使用
super()).__init__
传递这些参数
model_type
的作用是把模型注册到
AutoClass
中,建议设置。
Model
参考:Building
custom models (huggingface.co)
示例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 class LtgBertForMaskedLM (PreTrainedModel ): config_class=LtgBertConfig def __init__ (self,config,activation_checkpointing=False ): super ().__init__(config) self.model=Bert( config=config, activation_checkpointing=activation_checkpointing ) self.require_all_hidden_states=config.require_all_hidden_states self.batch_first=config.batch_first def forward (self, input_ids, attention_mask, masked_lm_labels=None ): if self.batch_first: input_ids=input_ids.transpose(0 ,1 ) if masked_lm_labels is not None : masked_lm_labels=masked_lm_labels.transpose(0 ,1 ) subword_prediction=self.model(input_ids, attention_mask, masked_lm_labels=masked_lm_labels) loss=None if masked_lm_labels is not None : target_ids = masked_lm_labels.flatten() target_ids = target_ids[target_ids != -100 ] loss = F.cross_entropy(subword_prediction, target_ids) all_hidden_states=None if self.require_all_hidden_states: all_hidden_states=self.model.get_contextualized(input_ids=input_ids,attention_mask=attention_mask) if self.batch_first: if len (subword_prediction.size())>2 : subword_prediction=subword_prediction.transpose(0 ,1 ) if all_hidden_states is not None : all_hidden_states=[it.transpose(0 ,1 ) for it in all_hidden_states] return MaskedLMOutput( loss=loss, logits=subword_prediction, hidden_states=all_hidden_states, attentions=None )
对于自定义模型,往往每个
AutoClass
上都会注册一个模型,因此往往要写多个自定义模型。
config_type
的作用是把模型注册到
AutoClass
中,建议设置。
由于简约性原则,官方要求
self.model
对应原来的模型,比如用Pytorch定义的模型。
forward
函数需要注意结果格式,transformers.modeling_outputs
里定义了每种模型forward的结果格式。
其中对于每个特定的子任务都有个类似的模型,对于部分函数比如forward建议参考已有的代码进行操作,因为hf框架在使用特定子任务的模型的时候,可能会添加特殊的参数。比如,对于序列分类任务SequenceClassification,其中相关模型的forward为:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 def forward( self, input_ids: Optional[torch.Tensor] = None, attention_mask: Optional[torch.Tensor] = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, inputs_embeds: Optional[torch.Tensor] = None, return_dict: Optional[bool] = None, labels: Optional[torch.LongTensor] = None, ) -> Union[Tuple[torch.Tensor], SequenceClassifierOutput]: if self.batch_first: # 模型把batch放在第二个维度 input_ids=input_ids.transpose(0,1) contextualized_embeddings=self.model.get_contextualized(input_ids, attention_mask) if self.batch_first: contextualized_embeddings=contextualized_embeddings.transpose(0,1) logits = self.head(contextualized_embeddings[:, 0, :]) if labels is not None: if self.config.problem_type is None: if self.num_labels == 1: self.config.problem_type = "regression" elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int): self.config.problem_type = "single_label_classification" else: self.config.problem_type = "multi_label_classification" if self.config.problem_type == "regression": loss_fct = nn.MSELoss() if self.num_labels == 1: loss = loss_fct(logits.squeeze(), labels.squeeze()) else: loss = loss_fct(logits, labels) elif self.config.problem_type == "single_label_classification": loss_fct = nn.CrossEntropyLoss() loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1)) elif self.config.problem_type == "multi_label_classification": loss_fct = nn.BCEWithLogitsLoss() loss = loss_fct(logits, labels) assert output_attentions is None assert output_hidden_states is None return SequenceClassifierOutput( loss=loss, logits=logits, hidden_states=contextualized_embeddings if output_hidden_states else None, attentions=None )
这里hf框架会在配置中添加problem_type等内容。
注册
如果在ckpt文件夹的 config.json
里没有
auto_map
指明 AutoClass
的注册:
1 2 3 4 "auto_map" : { "AutoConfig" : "configuration_ltgbert.LtgBertConfig" , "AutoModelForMaskedLM" : "modeling_ltgbert.LtgBertForMaskedLM" }
那么需要手动添加,在读取ckpt的代码里添加:
1 2 3 AutoConfig.register("LtgBert", LtgBertConfig) AutoModelForMaskedLM.register(LtgBertConfig, LtgBertForMaskedLM)
如果希望在保存模型的时候 config.json
文件中自动包含
auto_map
,可以添加以下代码(如果模型是从ckpt里加载的就不需要添加):
1 2 LtgBertConfig.register_for_auto_class() LtgBertForMaskedLM.register_for_auto_class("AutoModelForMaskedLM" )
后来发现只有注册可能会存在 config.json
里
auto_map
不完整的情况(原因暂时没有调查),可以考虑直接在
config.__init__
里强制指定:
1 2 3 4 5 6 def __init__(self,....): ... self.auto_map={ "AutoConfig": "configuration_ltgbert.LtgBertConfig", "AutoModelForMaskedLM": "modeling_ltgbert.LtgBertForMaskedLM" }
Trainer
训练流程的转换主要设计HF的 Trainer
类,可以参考Trainer
(huggingface.co) 和Trainer
(huggingface.co) 。
Trainer
把训练的流程分为几个过程,通过继承以及重写相关函数即可完成流程的定制,通过参数即可实现超参数的设置,细节阅读参考资料。
Dataset
dataset
可以继承自
torch.utils.data.dataset
,但是需要注意
__getitem__
,默认情况该函数返回的需要满足
dict
格式,从而实现参数的设置。
参考资料
Since the comment system relies on GitHub's Discussions feature, by default, commentators will receive all notifications. You can click "unsubscribe" in the email to stop receiving them, and you can also manage your notifications by clicking on the following repositories: bg51717/Hexo-Blogs-comments