介绍
这篇博客主要介绍了怎么把一个已有的Pytorch代码转变成HF支持的格式,然后可以方便的放入HF代码流程中,并且使用一些HF的函数。代码转换主要涉及到以下几个方面:
- Config
- Model
- Trainer
- Dataset
因为ckpt里面的代码使用的会是相对导入,所以在转换的过程中,建议把
configuration_xxx.py
和
modeling_xxx.py
文件放在同一个目录下,并且添加
__init__.py
文件。
Config
参考:Building custom models (huggingface.co)
示例:
from transformers import PretrainedConfig
from typing import List
class LtgBertConfig(PretrainedConfig):
model_type = "LtgBert"
"""Configuration class to store the configuration of a `LtgBertModel`.
"""
def __init__(self,
vocab_size_or_config_json_file=16384,
hidden_size=768,
num_hidden_layers=12,
num_attention_heads=12,
intermediate_size=3072,
hidden_act="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
max_position_embeddings=512,
type_vocab_size=2,
initializer_range=0.02,
output_all_encoded_layers=False,
require_all_hidden_states=True,
batch_first=True,
**kwargs):
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.hidden_act = hidden_act
self.intermediate_size = intermediate_size
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.max_position_embeddings = max_position_embeddings
self.type_vocab_size = type_vocab_size
self.initializer_range = initializer_range
self.output_all_encoded_layers = output_all_encoded_layers
self.require_all_hidden_states = require_all_hidden_states
self.batch_first=batch_first
if isinstance(vocab_size_or_config_json_file, str) or (sys.version_info[0] == 2
and isinstance(vocab_size_or_config_json_file, unicode)):
with open(vocab_size_or_config_json_file, "r", encoding='utf-8') as reader:
json_config = json.loads(reader.read())
for key, value in json_config.items():
self.__dict__[key] = value
elif isinstance(vocab_size_or_config_json_file, int):
self.vocab_size = vocab_size_or_config_json_file
else:
raise ValueError("First argument must be either a vocabulary size (int)"
"or the path to a pretrained model config file (str)")
super(LtgBertConfig, self).__init__(**kwargs)
必须满足:
- 继承自
PretrainedConfig
__init__
函数接受kwargs
,并且使用super()).__init__
传递这些参数
model_type
的作用是把模型注册到
AutoClass
中,建议设置。
Model
参考:Building custom models (huggingface.co)
示例:
class LtgBertForMaskedLM(PreTrainedModel):
config_class=LtgBertConfig
def __init__(self,config,activation_checkpointing=False):
super().__init__(config)
# 这里可以把成员变成类的继承LtgBertForMaskedLM(Bert):
self.model=Bert(
config=config,
activation_checkpointing=activation_checkpointing
)
self.require_all_hidden_states=config.require_all_hidden_states
self.batch_first=config.batch_first
def forward(self, input_ids, attention_mask, masked_lm_labels=None):
if self.batch_first:
# 模型把batch放在第二个维度
input_ids=input_ids.transpose(0,1)
if masked_lm_labels is not None:
masked_lm_labels=masked_lm_labels.transpose(0,1)
subword_prediction=self.model(input_ids, attention_mask, masked_lm_labels=masked_lm_labels)
loss=None
if masked_lm_labels is not None:
target_ids = masked_lm_labels.flatten()
target_ids = target_ids[target_ids != -100]
loss = F.cross_entropy(subword_prediction, target_ids)
all_hidden_states=None
if self.require_all_hidden_states:
all_hidden_states=self.model.get_contextualized(input_ids=input_ids,attention_mask=attention_mask)
if self.batch_first:
if len(subword_prediction.size())>2:
subword_prediction=subword_prediction.transpose(0,1)
if all_hidden_states is not None:
all_hidden_states=[it.transpose(0,1) for it in all_hidden_states]
return MaskedLMOutput(
loss=loss,
logits=subword_prediction,
hidden_states=all_hidden_states,
attentions=None
)
对于自定义模型,往往每个
AutoClass
上都会注册一个模型,因此往往要写多个自定义模型。
config_type
的作用是把模型注册到
AutoClass
中,建议设置。
由于简约性原则,官方要求
self.model
对应原来的模型,比如用Pytorch定义的模型。
forward
函数需要注意结果格式,transformers.modeling_outputs
里定义了每种模型forward的结果格式。
其中对于每个特定的子任务都有个类似的模型,对于部分函数比如forward建议参考已有的代码进行操作,因为hf框架在使用特定子任务的模型的时候,可能会添加特殊的参数。比如,对于序列分类任务SequenceClassification,其中相关模型的forward为:
def forward(
self,
input_ids: Optional[torch.Tensor] = None,
attention_mask: Optional[torch.Tensor] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
inputs_embeds: Optional[torch.Tensor] = None,
return_dict: Optional[bool] = None,
labels: Optional[torch.LongTensor] = None,
) -> Union[Tuple[torch.Tensor], SequenceClassifierOutput]:
if self.batch_first:
# 模型把batch放在第二个维度
input_ids=input_ids.transpose(0,1)
contextualized_embeddings=self.model.get_contextualized(input_ids, attention_mask)
if self.batch_first:
contextualized_embeddings=contextualized_embeddings.transpose(0,1)
logits = self.head(contextualized_embeddings[:, 0, :])
if labels is not None:
if self.config.problem_type is None:
if self.num_labels == 1:
self.config.problem_type = "regression"
elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
self.config.problem_type = "single_label_classification"
else:
self.config.problem_type = "multi_label_classification"
if self.config.problem_type == "regression":
loss_fct = nn.MSELoss()
if self.num_labels == 1:
loss = loss_fct(logits.squeeze(), labels.squeeze())
else:
loss = loss_fct(logits, labels)
elif self.config.problem_type == "single_label_classification":
loss_fct = nn.CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
elif self.config.problem_type == "multi_label_classification":
loss_fct = nn.BCEWithLogitsLoss()
loss = loss_fct(logits, labels)
assert output_attentions is None
assert output_hidden_states is None
return SequenceClassifierOutput(
loss=loss,
logits=logits,
hidden_states=contextualized_embeddings if output_hidden_states else None,
attentions=None
)
这里hf框架会在配置中添加problem_type等内容。
注册
如果在ckpt文件夹的 config.json
里没有
auto_map
指明 AutoClass
的注册:
"auto_map": {
"AutoConfig": "configuration_ltgbert.LtgBertConfig",
"AutoModelForMaskedLM": "modeling_ltgbert.LtgBertForMaskedLM"
}
那么需要手动添加,在读取ckpt的代码里添加:
AutoConfig.register("LtgBert", LtgBertConfig)
AutoModelForMaskedLM.register(LtgBertConfig, LtgBertForMaskedLM)
如果希望在保存模型的时候 config.json
文件中自动包含
auto_map
,可以添加以下代码(如果模型是从ckpt里加载的就不需要添加):
LtgBertConfig.register_for_auto_class()
LtgBertForMaskedLM.register_for_auto_class("AutoModelForMaskedLM")
后来发现只有注册可能会存在 config.json
里
auto_map
不完整的情况(原因暂时没有调查),可以考虑直接在
config.__init__
里强制指定:
def __init__(self,....):
...
self.auto_map={
"AutoConfig": "configuration_ltgbert.LtgBertConfig",
"AutoModelForMaskedLM": "modeling_ltgbert.LtgBertForMaskedLM"
}
Trainer
训练流程的转换主要设计HF的 Trainer
类,可以参考Trainer
(huggingface.co)和Trainer
(huggingface.co)。
Trainer
把训练的流程分为几个过程,通过继承以及重写相关函数即可完成流程的定制,通过参数即可实现超参数的设置,细节阅读参考资料。
Dataset
dataset
可以继承自
torch.utils.data.dataset
,但是需要注意
__getitem__
,默认情况该函数返回的需要满足
dict
格式,从而实现参数的设置。