人智《Transformer与多任务学习》

真真夜夜 · 2023/09/10

学习内容

针对AIWIN中文保险小样本多任务竞赛开展，将讲解transformer和比赛中具体使用的知识点和细节。 在本次学习中我们将学习BERT模型原理，transformer基础操作和BERT预训练。

任务名称	难度
任务1：AIWIN比赛报名	低、1
任务2：BERT与NLP任务介绍	低、1
任务3：transformers使用	低、1
任务4：BERT下游任务	中、2
任务5：BERT预训练	高、3
任务6：Prompt基础	高、3

真真夜夜 · 2023/09/10

任务一

成功报名链接：http://ailab.aiwin.org.cn/competitions/68
跑通比赛baseline提交结果
BERT相关资料

BERT & transformer入门

?自然语言处理：文本分类、命名实体识别、问答、语言建模、摘要、翻译、多项选择和文本生成。

?️计算机视觉：图像分类、对象检测和分割。

?️音频：自动语音识别和音频分类。

?多模态：表格问答、光学字符识别、扫描文档信息提取、视频分类和视觉问答。

Task	Description	Modality	Pipeline identifier
Text classification	assign a label to a given sequence of text	NLP	pipeline(task=“sentiment-analysis”)
Text generation	generate text given a prompt	NLP	pipeline(task=“text-generation”)
Summarization	generate a summary of a sequence of text or document	NLP	pipeline(task=“summarization”)
Image classification	assign a label to an image	Computer vision	pipeline(task=“image-classification”)
Image segmentation	assign a label to each individual pixel of an image (supports semantic, panoptic, and instance segmentation)	Computer vision	pipeline(task=“image-segmentation”)
Object detection	predict the bounding boxes and classes of objects in an image	Computer vision	pipeline(task=“object-detection”)
Audio classification	assign a label to some audio data	Audio	pipeline(task=“audio-classification”)
Automatic speech recognition	transcribe speech into text	Audio	pipeline(task=“automatic-speech-recognition”)
Visual question answering	answer a question about the image, given an image and a question	Multimodal	pipeline(task=“vqa”)
Document question answering	answer a question about the document, given a document and a question	Multimodal	pipeline(task=“document-question-answering”)
Image captioning	generate a caption for a given image	Multimodal	pipeline(task=“image-to-text”)

快速入门

1. 我们选择情感分类任务，pipeline()下载并缓存默认的预训练模型和tokenizer以进行情感分析。

Python：

from transformers import pipeline
classifier = pipeline("sentiment-analysis")
classifier("We are very happy to show you the ? Transformers library.")

输出:
[{'label': 'POSITIVE', 'score': 0.9997795224189758}]

2. 如果有多个输入，请将输入作为列表传递给pipeline()以返回字典列表：

Python：

results = classifier(["We are very happy to show you the ? Transformers library.", "We hope you don't hate it."])
for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

输出：
label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309

3. pipeline()可以容纳来自Hub的任何模型。例如，如果您需要一个能够处理法语文本的模型，可以使用Hub上的标签来过滤出适合的模型。经过筛选后返回的最佳结果是一个多语言BERT模型，经过微调以进行情感分析，您可以将其用于法语文本。

Python：

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"

4. 使用AutoModelForSequenceClassification和AutoTokenizer加载预训练模型及其关联的分词器

Python：

from transformers import AutoTokenizer, AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

5. 在pipeline()中指定model和tokenizer，现在您可以将分类器应用于法语文本：

Python：

classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
classifier("Nous sommes très heureux de vous présenter la bibliothèque ? Transformers.")

### AutoTokenizer
Tokenizer负责将文本预处理成数字数组，作为模型的输入。有多个规则指导标记化过程，包括如何分割单词以及应该在哪个级别分割单词。最重要的是要记住，需要使用与预训练模型相同的模型名称实例化tokenizer，以确保使用相同的标记化规则。

Python：

from transformers import AutoTokenizer
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)

将文本传递给Tokenizer

Python：

encoding = tokenizer("We are very happy to show you the ? Transformers library.")
print(encoding)

输出：
{'input_ids': [101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 100, 58263, 13299, 119, 102],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

tokenizer返回一个字典，其中包含：
input_ids：你的标记的数值表示。
attention_mask：指示哪些标记应该被注意。
tokenizer 还可以接受一个输入列表，并对文本进行填充和截断，以返回具有统一长度的批次。

Python：

pt_batch = tokenizer(
    ["We are very happy to show you the ? Transformers library.", "We hope you don't hate it."],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt",
)

Transformers提供了一种简单而统一的预训练实例加载方式。这意味着你可以像加载AutoTokenizer一样加载一个AutoModel，唯一的区别是选择适合任务的正确AutoModel。对于文本（或序列）分类，你应该加载AutoModelForSequenceClassification：

Python：

from transformers import AutoModelForSequenceClassification
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)

现在，将预处理后的输入批次直接传递给模型。你只需要通过添加**来解包字典：

Python：

pt_outputs = pt_model(**pt_batch)

模型在logits属性中输出最终激活值。将softmax函数应用于logits以检索概率：

Python：

from torch import nn
pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1)
print(pt_predictions)

一旦模型经过微调，就可以使用PreTrainedModel.save_pretrained()将其与其tokenizer一起保存。

Python：

pt_save_directory = "./pt_save_pretrained"
tokenizer.save_pretrained(pt_save_directory)
pt_model.save_pretrained(pt_save_directory)

再次使用模型时，可以使用PreTrainedModel.from_pretrained()重新加载它。

Python：

pt_model = AutoModelForSequenceClassification.from_pretrained("./pt_save_pretrained")

一个Transformers特性是能够保存一个模型并将其重新加载为PyTorch或TensorFlow模型。from_pt或from_tf参数可以将模型从一个框架转换为另一个框架：

Python：

# 从tf -> pt
from transformers import AutoModel
tokenizer = AutoTokenizer.from_pretrained(tf_save_directory)
pt_model = AutoModelForSequenceClassification.from_pretrained(tf_save_directory, from_tf=True)

自定义模型构建

可以修改模型的配置类来改变模型的构建方式。配置指定了模型的属性，例如隐藏层的数量或注意力头的数量。当你从自定义配置类初始化一个模型时，你将从零开始构建。模型属性会被随机初始化，在训练模型之前你需要使用它来获得有意义的结果。

首先导入AutoConfig，然后加载你想要修改的预训练模型。在AutoConfig.from_pretrained()函数中，你可以指定要更改的属性，例如注意力头的数量：

Python：

from transformers import AutoConfig
my_config = AutoConfig.from_pretrained("distilbert-base-uncased", n_heads=12)
# n_heads是一个参数，用于指定模型中的注意力头的数量

使用AutoModel.from_config()函数从自定义配置中创建一个模型：

Python：

from transformers import AutoModel
my_model = AutoModel.from_config(my_config)

Trainer - Pytorch优化的循环

所有模型都是标准的torch.nn.Module，因此您可以在任何典型的训练循环中使用它们。虽然您可以编写自己的训练循环，但Transformers为PyTorch提供了Trainer类，其中包含基本的训练循环并添加了诸如分布式训练、混合精度等附加功能。

1. 从PreTrainedModel或torch.nn.Module开始：

Python：

from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")

2. TrainingArguments包含可以更改的模型超参数，例如学习率、批大小和训练的周期数。如果您不指定任何训练参数，则使用默认值。

Python：

from transformers import TrainingArguments
training_args = TrainingArguments(
    output_dir="path/to/save/folder/",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
)

3. 加载预处理类，例如分词器、图像处理器、特征提取器或处理器。

Python：

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

4. 加载数据集

Python：

from datasets import load_dataset
dataset = load_dataset("rotten_tomatoes")  # doctest: +IGNORE_RESULT

5. 创建标记化数据集的函数，然后使用map函数将其应用于整个数据集：

Python：

def tokenize_dataset(dataset):
    return tokenizer(dataset["text"])
dataset = dataset.map(tokenize_dataset, batched=True)

6. 创建一个DataCollatorWithPadding，通过使用这个工具，我们可以将数据集中的一批样本进行排序和处理

Python：

from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

现在将所有这些类收集到Trainer中：

Python：

from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)  # doctest: +SKIP

使用train()函数进行训练

Python：

trainer.train()

对于使用序列到序列模型的任务，例如翻译或摘要，请改用Seq2SeqTrainer和Seq2SeqTrainingArguments类。

真真夜夜 · 2023/09/10

任务一更新中 ...

搜索

欢迎来到 DH Hub!

人智《Transformer与多任务学习》

真真夜夜

成员

学习内容

真真夜夜

成员

任务一

BERT & transformer入门

快速入门

自定义模型构建

Trainer - Pytorch优化的循环

真真夜夜

成员

友情链接

在线用户

在线统计

欢迎来到 DH Hub!

人智 《Transformer与多任务学习》

真真夜夜

成员

学习内容​

​

真真夜夜

成员

任务一​

BERT & transformer入门​

快速入门​

自定义模型构建​

Trainer - Pytorch优化的循环​

真真夜夜

成员

友情链接

在线用户

在线统计

人智《Transformer与多任务学习》

学习内容

任务一

BERT & transformer入门

快速入门

自定义模型构建

Trainer - Pytorch优化的循环