テキスト分類に LLM を使用する: ベースを選択するかチャットを選択するかを微調整します。

データホエール

編集者のおすすめ:

Datawhaleの推奨事項

以下の記事は、LeonYi が執筆した BaoBao Algorithm Notes からの抜粋です。

[

パケットアルゴリズムに関する注意事項。

大規模モデル技術と業界知識

](#)

作者：LeonYi 链接：https://www.zhihu.com/question/632473480/answer/75664255663

Qwen2ForSequenceClassification を使用してテキスト分類タスクを実装します。

I. 実験結果と結論

過去数か月間、大規模なモデル分類シナリオで多くの実験を実施し、経験を積んできました。

1. 短いテキスト

1) クエリ感情分類は一般的に BERT より劣ります。

追記：結論は基本的に https://segmentfault.com/a/11... の結論と一致しています。

2. 長いテキスト

1) 長いテキストの通話 ASR 翻訳では、BERT の 512 の切り捨ては LLM より劣ります。

LLM は切り捨てません (両方ともステージ 512 の場合、効果は同様になる可能性があります)。
比較なしで、スライディングテキストウィンドウを使用する BERT のバージョンを次に示します。

2) ベース vs 指示

データ量が少ない場合、ベースベースの微調整は命令ベースの微調整ほど効果的ではありません (命令ベースのモデルにはアライメント税がありますが、データ量が少ない微調整の場合、命令を見ずにサンプルをベースベースで微調整するよりも効果は優れています)。

3) SFT 対 LoRA

データ量が少ない場合（サンプル合計が 10K 未満、状況に応じて各ラベルを調整する必要がある）、SFT の微調整は LoRA ほど効果的ではありません（SFT パラメータの調整も高価です）。

3. 分類シナリオの改善計画

1) 独自の生成的微調整

同じドメイン内の異なるビジネス領域からの類似データを混合すると、いくつかの方法でパフォーマンスが向上します。

データの分布、特にテキストの長さは、あまり変化してはなりません。変化が大きすぎると、そのようなデータを混ぜると実際に効果が低下します (平均長が 1.2K のデータと平均長が 5K のデータなど)。
混合比率 (およそ 2:1 ですが、異なるシナリオでテストする必要がある場合があります)。
順序の混合 (ランダムサンプリングを使用しましたが、トレーニング順序を分離すると影響があるかどうかは確認していません)。

プロンプトを最適化します (プロンプトに各カテゴリタグの簡潔な説明を追加し、短いテキストを試します)。

2) 分類ヘッドの微調整 + 生成的微調整

大規模なデータセット (10K を超え、平均してラベルごとに十分なサンプルがある) を扱う場合は、Instruct データセットではなく Base データセットを微調整してみてください。
データ拡張: ラベルなしデータ (プロンプト単語から抽出されたラベル + 微調整されたモデルによって抽出されたラベル) で疑似ラベル付きサンプルを実行してみてください。
大量のデータを扱う場合は、SFT を試してください。
LoRA を微調整すると、LLM 埋め込みレイヤーが追加されます (未検証)。
大きなモデルを小さなモデルに分割してみます (問題点: 大きなモデルはパラメータの調整が難しく、トレーニングコストが高く、展開には依然として小さなモデルが必要です)。
LoRAのバリエーションを試す
パラメータのチューニングを試みました（Optuna の自動探索を使ってみましたが、結果はあまり良くありませんでした。通常は lr、epoch、rank を調整するだけです）。

3) ヘビー級

ベースモデルは増分ドメインデータを使用して事前トレーニングされ、その後、指示を使用して微調整されます。

ソリューションを検証する必要があります (検証に成功すると、トレーニング済みのベースをさまざまなドメインタスクで微調整できるという利点があります)。
Qwen2-7B-Instructの近傍データ指示を用いてモデルの微調整を試みましたが、結果はQwen2-7B-Instructを直接微調整した場合よりも悪くなりました。（モデルの学習手順の詳細が不明なため、理由は不明です。）

最優先事項は依然としてデータ収集です。その後、必要に応じて要素を追加または削除しながら、さまざまなアプローチを試してみる必要があります。

4. 注意すべき点

学習率: トレーニングするパラメータの数が多いほど、学習率は小さくする必要があります。
ラベルノイズ: 誤ってラベル付けされたサンプルは、エラー分析中に削除して修正する必要があります。
分類ビジネスルール: 複雑なシナリオでは、やり直しを避けるために、完全な注釈ルールを事前に決定する必要があります (どのモデルで実行でき、どのモデルでは実行できないか)。

改善すべき点:

動的テキストパディング
カテゴリヘッダーのマルチタグバージョンの有効性は検証されていません。

II. テキスト分類 - BERTからLLMへ

Qwen2ForSequenceClassification、LlamaForSequenceClassification、および BERTForSequenceClassification はすべて、テキスト分類タスクを実行するために使用できます。

HuggingFace トランスフォーマーの AutoModelForSequenceClassification ライブラリを使用すると、対応するモデルクラスを自動的にロードし、シーケンス分類タスクを実行できます。

Qwen2ForSequenceClassificationとBERTForSequenceClassificationは論理的に一貫しています。どちらもモデルの出力層に線形層を追加して分類タスクを実行します。

BERT にこれまでに加えられたすべての変更は LLM に移行できます。例えば、BERT-CRF や BERT-SUM などです。

2.1 BERTForSequenceClassification

BertForSequenceClassification は、BertPreTrainedModel から継承され、通常はテキスト分類タスクに使用される、テキスト分類用の事前実装クラスです。

カテゴリの数はnum_labelsに渡されます。コンストラクタからわかるように、このクラスはBertModel、Dropout、そして分類のための線形分類器という3つの部分から構成されています。

class BertForSequenceClassification(BertPreTrainedModel): def __init__(self, config): super(BertForSequenceClassification, self).__init__(config) self.num_labels = config.num_labels python self.bert = BertModel(config) self.dropout = nn.Dropout(config.hidden_dropout_prob) self.classifier = nn.Linear(config.hidden_size, self.config.num_labels) self.init_weights()

BERTは埋め込みのためのテキスト特徴抽出に使用され、Dropoutは過学習を防止し、Linearは分類のための弱分類器として使用されます。分類にさらに複雑なネットワーク構造が必要な場合は、これに基づいて書き換えることができます。

`forward()` 関数は既に損失関数を定義しているため、トレーニング中に別途実装する必要はありません。戻り値には4つの項目が含まれます。

def forward(...): ... if labels is not None: if self.num_labels == 1: # We are doing regression loss_fct = MSELoss() loss = loss_fct(logits.view(-1), labels.view(-1)) else: loss_fct = CrossEntropyLoss() loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1)) outputs = (loss,) + outputs return outputs # (loss), logits, (hidden_states), (attentions)

2.2 Qwen2ForSequenceClassification

次に、Qwen2ForSequenceClassification を見てみましょう。

Qwen2ForSequenceClassification( (model): Qwen2Model( (embed_tokens): Embedding(151936, 1024, padding_idx=151643) (layers): ModuleList( (0-23): 24 x Qwen2DecoderLayer( (self_attn): Qwen2SdpaAttention( (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (o_proj): Linear(in_features=1024, out_features=1024, bias=False) (rotary_emb): Qwen2RotaryEmbedding() ) (mlp): Qwen2MLP( (gate_proj): Linear(in_features=1024, out_features=2816, bias=False) (up_proj): Linear(in_features=1024, out_features=2816, bias=False) (down_proj): Linear(in_features=2816, out_features=1024, bias=False) (act_fn): SiLU() ) (input_layernorm): Qwen2RMSNorm() (post_attention_layernorm): Qwen2RMSNorm() ) ) (norm): Qwen2RMSNorm() ) (score): Linear(in_features=1024, out_features=3, bias=False) )

Qwen2 の公式コード実装。3 つの組み込みモードを備えています。

単一ラベル分類

損失はCrossEntropyLossです
各単一ラベルに対応するロジットを取得し、負の対数尤度を計算します。

マルチラベル分類

損失はBCEWithLogitsLossです
ラベルはマルチホットです。予測ロジットはシグモイド関数を使用して計算され、実際のロジット値は対応する次元ラベルから取得され、損失が合計されます。

回帰

損失はMSELossです
デフォルトは 1 次元回帰です (回帰はスコアを予測するための報酬モデルとして使用できます)。

これら 3 つのモードでは入力ラベルが異なります。

class Qwen2ForSequenceClassification(Qwen2PreTrainedModel): def __init__(self, config): super().__init__(config) self.num_labels = config.num_labels self.model = Qwen2Model(config) self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False) # Initialize weights and apply final processing self.post_init() def get_input_embeddings(self): return self.model.embed_tokens def set_input_embeddings(self, value): self.model.embed_tokens = value @add_start_docstrings_to_model_forward(QWEN2_INPUTS_DOCSTRING) def forward( self, input_ids: torch.LongTensor = None, attention_mask: Optional[torch.Tensor] = None, position_ids: Optional[torch.LongTensor] = None, past_key_values: Optional[List[torch.FloatTensor]] = None, inputs_embeds: Optional[torch.FloatTensor] = None, labels: Optional[torch.LongTensor] = None, use_cache: Optional[bool] = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, ) -> Union[Tuple, SequenceClassifierOutputWithPast]: r""" labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*): Labels for computing the sequence classification/regression loss. Indices should be in `[0, ..., config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If `config.num_labels > 1` a classification loss is computed (Cross-Entropy). """ return_dict = return_dict if return_dict is not None else self.config.use_return_dict transformer_outputs = self.model( input_ids, attention_mask=attention_mask, position_ids=position_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds, use_cache=use_cache, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict, ) hidden_states = transformer_outputs[0] logits = self.score(hidden_states) if input_ids is not None: batch_size = input_ids.shape[0] else: batch_size = inputs_embeds.shape[0] if self.config.pad_token_id is None and batch_size != 1: raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.") if self.config.pad_token_id is None: sequence_lengths = -1 else: if input_ids is not None: # if no pad token found, use modulo instead of reverse indexing for ONNX compatibility sequence_lengths = torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1 sequence_lengths = sequence_lengths % input_ids.shape[-1] sequence_lengths = sequence_lengths.to(logits.device) else: sequence_lengths = -1 pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths] loss = None if labels is not None: labels = labels.to(logits.device) if self.config.problem_type is None: if self.num_labels == 1: self.config.problem_type = "regression" elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int): self.config.problem_type = "single_label_classification" else: self.config.problem_type = "multi_label_classification" if self.config.problem_type == "regression": loss_fct = MSELoss() if self.num_labels == 1: loss = loss_fct(pooled_logits.squeeze(), labels.squeeze()) else: loss = loss_fct(pooled_logits, labels) elif self.config.problem_type == "single_label_classification": loss_fct = CrossEntropyLoss() loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1)) elif self.config.problem_type == "multi_label_classification": loss_fct = BCEWithLogitsLoss() loss = loss_fct(pooled_logits, labels) if not return_dict: output = (pooled_logits,) + transformer_outputs[1:] return ((loss,) + output) if loss is not None else output return SequenceClassifierOutputWithPast( loss=loss, logits=pooled_logits, past_key_values=transformer_outputs.past_key_values, hidden_states=transformer_outputs.hidden_states, attentions=transformer_outputs.attentions, )

III. LoRA の Qwen2ForSequenceClassification の微調整

LoRAによる微調整後、LoRAの重みは統合され、モデルが保存されます。これは、PEFTコードには現在、分類ヘッドの線形層のパラメータを保存するオプションがないためであり、LoRAの重みのみに頼るだけでは、学習済みのQwen2ForSequenceClassificationモデルを再現するには不十分です。

必要に応じて、コードを少し変更することもできます。

ここでは、ModelScope が提供する環境内でコードテストを完了しました。

from modelscope import AutoModelForCausalLM, AutoTokenizer model_name_or_path = "qwen/Qwen2.5-3B-Instruct" model = AutoModelForCausalLM.from_pretrained( model_name_or_path, torch_dtype="auto", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) prompt = "不想学习怎么办？有兴趣，但是拖延症犯了" messages = [ {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."}, {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) generated_ids = model.generate( **model_inputs, max_new_tokens=512 ) generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) ] response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] ` shell Downloading [config.json]: 100%|██████████| 661/661 [00:00<00:00, 998B/s] Downloading [configuration.json]: 100%|██████████| 2.00/2.00 [00:00<00:00, 2.30B/s] Downloading [generation_config.json]: 100%|██████████| 242/242 [00:00<00:00, 557B/s] Downloading [LICENSE]: 100%|██████████| 7.21k/7.21k [00:00<00:00, 11.7kB/s] Downloading [merges.txt]: 100%|██████████| 1.59M/1.59M [00:00<00:00, 3.01MB/s] Downloading [model-00001-of-00002.safetensors]: 100%|██████████| 3.70G/3.70G [00:10<00:00, 373MB/s] Downloading [model-00002-of-00002.safetensors]: 100%|██████████| 2.05G/2.05G [00:06<00:00, 332MB/s] Downloading [model.safetensors.index.json]: 100%|██████████| 34.7k/34.7k [00:00<00:00, 56.8kB/s] Downloading [README.md]: 100%|██████████| 4.79k/4.79k [00:00<00:00, 10.3kB/s] Downloading [tokenizer.json]: 100%|██████████| 6.71M/6.71M [00:00<00:00, 8.58MB/s] Downloading [tokenizer_config.json]: 100%|██████████| 7.13k/7.13k [00:00<00:00, 13.8kB/s] Downloading [vocab.json]: 100%|██████████| 2.65M/2.65M [00:00<00:00, 5.09MB/s] /usr/local/lib/python3.10/site-packages/accelerate/utils/modeling.py:1405: UserWarning: Current model requires 234882816 bytes of buffer for offloaded layers, which seems does not fit any GPU's remaining memory. If you are experiencing a OOM later, please consider using offload_buffers=True. warnings.warn(

興味と先延ばしの葛藤は、時に非常にストレスの溜まるものです。先延ばしを克服し、勉強をもっと続けるためのヒントをいくつかご紹介します。

小さな目標を設定する：大きな目標を小さな目標に細分化しましょう。小さな目標を一つ一つ達成することは小さな勝利であり、モチベーションと達成感を高めることができます。
計画を立てる：自分に合った詳細な学習計画を立て、可能な限りそれに従うようにしましょう。休憩時間も確保し、仕事と休息のバランスをうまく取ることも忘れないでください。
前向きな姿勢を保つ：忍耐強く、自分自身を理解し、一時的な困難に屈しないでください。進歩のプロセスは成長のプロセスであることを忘れないでください。

ModelScopeはLoRaをサポートしていないので、ローカルパスを確認しました。

print(model.model_dir) /mnt/workspace/.cache/modelscope/hub/qwen/Qwen2___5-3B-Instruct

ファイルを表示

config.json merges.txt README.md configuration.json model-00001-of-00002.safetensors tokenizer_config.json generation_config.json model-00002-of-00002.safetensors tokenizer.json LICENSE model.safetensors.index.json vocab.json huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...

コードの微調整

### 初始化设定和随机种子import os os.environ["CUDAVISIBLE_DEVICES"] = "0" import torch import numpy as np import pandas as pd import random seed = 42 random.seed(seed) np.random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed(seed) torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False

プロンプトに基づいて、大きなモデルを使用して 20 個のサンプルが作成されました。

import json x = '''这基金表现也太差了吧，买了半年了还亏着呢。管理费收得比别的基金都高，感觉就是在给基金公司打工。想查查具体投了啥，结果发现透明度低得要命，啥也看不清楚。基金经理换来换去的，都不知道到底谁在管我的钱。客服电话打过去半天才有人接，问个问题还得等上好几天才有回复。市场稍微有点风吹草动，这基金就跌得比谁都快。投资组合里全是同一行业的股票，风险大得让人睡不着觉。长期持有也没见赚多少钱，还不如存银行定期。分红政策一会儿一个样，根本没法做财务规划。当初宣传时说得好听，实际操作起来完全不是那么回事。''' x_samples = x.split("\n") y = '''这基金真的稳啊，买了之后收益一直挺不错的，感觉很靠谱！管理团队超级专业，每次市场波动都能及时调整策略，让人放心。透明度很高，随时都能查到投资组合的情况，心里有数。基金经理经验老道，看准了几个大机会，赚了不少。客服态度特别好，有问题总能很快得到解答，服务真是没得说。即使在市场不好的时候，这基金的表现也比大多数同类产品强。分散投资做得很好，风险控制得很到位，睡个安稳觉没问题。长期持有的话，回报率真的非常可观，值得信赖。分红政策明确而且稳定，每年都能按时收到分红，计划财务很方便。宣传时承诺的那些好处都实现了，真心觉得选对了这只基金。''' y_samples = y.split("\n") # 创建一个Python字典x_data = [{"content": i, "label": 0, "标注类别": "正向"} for i in x_samples] y_data = [{"content": i, "label": 1, "标注类别": "负向"} for i in y_samples] def save_json(path, data): # 将Python字典转换为JSON字符串 with open(path, 'w', encoding='utf-8') as f: json.dump(data, f, ensure_ascii=False, indent=4) save_json('data/classify_train.json', x_data[:6]+y_data[:6]) save_json('data/classify_valid.json', x_data[6:8]+y_data[6:8]) save_json('data/classify_test.json', x_data[8:]+y_data[8:])

データの読み込み

import json from tqdm import tqdm from loguru import logger from datasets import Dataset, load_dataset def get_dataset_from_json(json_path, cols): with open(json_path, "r") as file: data = json.load(file) df = pd.DataFrame(data) dataset = Dataset.from_pandas(df[cols], split='train') return dataset # load_dataset加载json的dataset太慢了cols = ['content', 'label', '标注类别'] train_ds = get_dataset_from_json('data/classify_train.json', cols) logger.info(f"TrainData num: {len(train_ds)}") valid_ds = get_dataset_from_json('data/classify_valid.json', cols) logger.info(f"ValidData num: {len(valid_ds)}") test_ds = get_dataset_from_json('data/classify_test.json', cols) logger.info(f"TestData num: {len(test_ds)}")

print(train_ds[0]) {'content': '这基金表现也太差了吧，买了半年了还亏着呢。', 'label': 0, '标注类别': '正向'}

データセットを準備します (単純な切り捨てとパディングを実装し、動的なパディングは行いません)。

id2label = {0: "正向", 1: "负向"} label2id = {v:k for k,v in id2label.items()} from transformers import AutoTokenizer, DataCollatorWithPadding # from modelscope import AutoTokenizer, DataCollatorwithPadding model_name_or_path = "/mnt/workspace/.cache/modelscope/hub/qwen/Qwen2___5-3B-Instruct" model_name = model_name_or_path.split("/")[-1] print(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, padding_side='left') tokenizer.add_special_tokens({'pad_token': '<|endoftext|>'}) data_collator = DataCollatorWithPadding(tokenizer=tokenizer) MAX_LEN = 24 txt_colname = 'content' def preprocess_function(examples): # padding后处理效率不高，需要动态batch padding return tokenizer(examples[txt_colname], max_length=MAX_LEN, padding=True, truncation=True) tokenized_train = train_ds.map(preprocess_function, num_proc=64, batched=True) tokenized_valid = valid_ds.map(preprocess_function, num_proc=64, batched=True)

sklearn 評価コード

from sklearn.metrics import ( classification_report, confusion_matrix, accuracy_score, f1_score, precision_score, recall_score ) def evals(test_ds, model): k_list = [x[txt_colname] for x in test_ds] model.eval() k_result = [] for idx, txt in tqdm(enumerate(k_list)): model_inputs = tokenizer([txt], max_length=MAX_LEN, truncation=True, return_tensors="pt").to(model.device) logits = model(**model_inputs).logits res = int(torch.argmax(logits, axis=1).cpu()) k_result.append(id2label.get(res)) y_true = np.array(test_ds['label']) y_pred = np.array([label2id.get(x) for x in k_result]) return y_true, y_pred def compute_metrics(eval_pred): predictions, label = eval_pred predictions = np.argmax(predictions, axis=1) return {"f1": f1_score(y_true=label, y_pred=predictions, average='weighted')} def compute_valid_metrics(eval_pred): predictions, label = eval_pred y_true, y_pred = label, predictions accuracy = accuracy_score(y_true, y_pred) print(f'Accuracy: {accuracy}') metric_types = ['micro', 'macro', 'weighted'] for metric_type in metric_types: precision = precision_score(y_true, y_pred, average=metric_type) recall = recall_score(y_true, y_pred, average=metric_type) f1 = f1_score(y_true, y_pred, average=metric_type) print(f'{metric_type} Precision: {precision}') print(f'{metric_type} Recall: {recall}') print(f'{metric_type} F1 Score: {f1}')

Trainerを使用したモデルのロードとトレーニング

import torch from transformers import AutoModelForSequenceClassification from transformers import Trainer, TrainingArguments from peft import get_peft_config, PeftModel, PeftConfig, get_peft_model, LoraConfig, TaskType rank = 64 alpha = rank*2 training_args = TrainingArguments( output_dir=f"./output/{model_name}/seqence_classify/", learning_rate=5e-5, per_device_train_batch_size=8, per_device_eval_batch_size=4, num_train_epochs=3, weight_decay=0.01, evaluation_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True ) peft_config = LoraConfig( task_type=TaskType.SEQ_CLS, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], inference_mode=False, r=rank, lora_alpha=alpha, lora_dropout=0.1 ) model = AutoModelForSequenceClassification.from_pretrained( model_name_or_path, num_labels=len(id2label), id2label=id2label, label2id=label2id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, attn_implementation="flash attention2" ) model.config.pad_token_id = tokenizer.pad_token_id model = get_peft_model(model, peft_config) model.print_trainable_parameters() trainer = Trainer( model=model, args=training_args, train_dataset=tokenized_train, eval_dataset=tokenized_valid, tokenizer=tokenizer, data_collator=data_collator, compute_metrics=compute_metrics ) logger.info(f"start Trainingrank: {rank}") trainer.train() logger.info(f"Valid Set, rank: {rank}") y_true, y_pred = evals(valid_ds, model) metrics = compute_valid_metrics((y_pred, y_true)) logger.info(metrics) logger.info(f"Test Set, rank: {rank}") y_true, y_pred = evals(test_ds, model) metrics = compute_valid_metrics((y_pred, y_true)) logger.info(metrics) saved_model = model.merge_and_unload() saved_model.save_pretrained('/model/qwen2-3b/seqcls')

LoraConfig と get_peft_model を削除すると、SFT コードが提供されます。

モデル構造

PeftModelForSequenceClassification( (base_model): LoraModel( (model): Qwen2ForSequenceClassification( (model): Qwen2Model( (embed_tokens): Embedding(151936, 2048) (layers): ModuleList( (0-35): 36 x Qwen2DecoderLayer( (self_attn): Qwen2SdpaAttention( (q_proj): Linear(in_features=2048, out_features=2048, bias=True) (k_proj): Linear(in_features=2048, out_features=256, bias=True) (v_proj): Linear(in_features=2048, out_features=256, bias=True) (o_proj): Linear(in_features=2048, out_features=2048, bias=False) (rotary_emb): Qwen2RotaryEmbedding() ) (mlp): Qwen2MLP( (gate_proj): Linear(in_features=2048, out_features=11008, bias=False) (up_proj): Linear(in_features=2048, out_features=11008, bias=False) (down_proj): Linear(in_features=11008, out_features=2048, bias=False) (act_fn): SiLU() ) (input_layernorm): Qwen2RMSNorm((2048,), eps=1e-06) (post_attention_layernorm): Qwen2RMSNorm((2048,), eps=1e-06) ) ) (norm): Qwen2RMSNorm((2048,), eps=1e-06) ) (score): Linear(in_features=2048, out_features=2, bias=False) ) ) )

予測する

txt = "退钱，什么辣鸡基金" model_inputs = tokenizer([txt], max_length=MAX_LEN, truncation=True, return_tensors="pt").to(saved_model.device) logits = saved_model(**model_inputs).logits res = int(torch.argmax(logits, axis=1).cpu()) print(id2label[res]) 负向

出力タイプ

SequenceClassifierOutputWithPast(loss=None, logits=tensor([[-0.1387, 2.3438]], device='cuda:0', grad_fn=<IndexBackward0>), past_key_values=((tensor([[[[ -3.3750, 0.3164, 2.3125, ..., 56.5000, 26.0000, 87.0000], [ -4.6875, 3.0312, 0.6875, ..., 57.7500, 24.3750, 86.0000], [ -0.7109, 1.1094, -0.7383, ..., 56.7500, 24.8750, 86.5000], ..., ..., [-0.2188, 0.2148, 0.4375, ..., -0.1016, 0.9336, -1.1016], [ 1.3281, 0.3359, 1.3125, ..., -0.3906, 0.0312, -0.0391], [ 0.8789, 0.5312, 1.4297, ..., 0.1797, -0.9609, -0.6445]]]], device='cuda:0'))), hidden_states=None, attentions=None)

テスト結果を実行する

Some weights of Qwen2ForSequenceClassification were not initialized from the model checkpoint at /mnt/workspace/.cache/modelscope/hub/qwen/Qwen2___5-3B-Instruct and are newly initialized: ['score.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Detected kernel version 4.19.91, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. 2024-10-09 23:54:07.615 | INFO | __main__:<module>:53 - start Trainingrank: 64 trainable params: 119,738,368 || all params: 3,205,681,152 || trainable%: 3.7352 [6/6 00:08, Epoch 3/3] Epoch Training Loss Validation Loss F1 1 No log 0.988281 0.333333 2 No log 0.527344 0.733333 3 No log 0.453125 1.000000 2024-10-09 23:54:17.371 | INFO | __main__:<module>:56 - Valid Set, rank: 64 4it [00:00, 8.03it/s] 2024-10-09 23:54:17.896 | INFO | __main__:<module>:59 - None 2024-10-09 23:54:17.897 | INFO | __main__:<module>:61 - Test Set, rank: 64 Accuracy: 1.0 micro Precision: 1.0 micro Recall: 1.0 micro F1 Score: 1.0 macro Precision: 1.0 macro Recall: 1.0 macro F1 Score: 1.0 weighted Precision: 1.0 weighted Recall: 1.0 weighted F1 Score: 1.0 4it [00:00, 13.58it/s] 2024-10-09 23:54:18.218 | INFO | __main__:<module>:64 - None Accuracy: 0.75 micro Precision: 0.75 micro Recall: 0.75 micro F1 Score: 0.75 macro Precision: 0.8333333333333333 macro Recall: 0.75 macro F1 Score: 0.7333333333333334 weighted Precision: 0.8333333333333333 weighted Recall: 0.75 weighted F1 Score: 0.7333333333333334

IV. セルフテスト結果

4.1 短いテキスト

一般的な会話では、顧客は通常、単一ターンのクエリを使用して感情の極性を分類します。

3クラス分類、最大文字数128文字、トレーニングサンプルサイズ約6K

query ，比不过基础的BERT Accuracy:0.9334389857369255 microPrecision:0.9334389857369255 microRecall:0.9334389857369255 micro F1Score:0.9334389857369255 macro Precision:0.9292774942877138 macro Reca1l:0.9550788300142491 macro F1Score:0.9388312342456646 weightedPrecision:0.9418775412386249 weighted Recall:0.9334389857369255 weighted F1Score:0.93383533375322 precision recall fi-score support 0 1.00 0.88 0.93 334 1 0.94 0.99 0.97 101 2 0.85 0.99 0.92 196 accuracy 0.93 macro avg 0.93 weightedavg 0.94

Chinese-RoBerta-large-wwmとその様々な派生版を用いて比較を行いました。7B、3B、1.5B、0.5Bでは優位性は見られませんでした。また、LargeおよびBaseよりも性能が劣っていました。パラメータを数十MBだけ削減した構成でも85程度しか性能が上がらなかったため、LLMの優位性は明らかではありませんでした。

結論は：

短いテキストのシナリオでは、LLM には、サンプル数が少なく、サンプルの分布が不均一で、プロンプトと少数のショットに基づいて 72 バイトのスケールで疑似ラベルを生成するという利点があります。

サンプルサイズが数万でシナリオの価値が高い場合を除き、14B 以上のモデルを試し、パラメータを調整して抽出ポイントを確認してから蒸留を実行できます。

ほとんどの短いテキストのシナリオでは、クエリの拡張や複数ラウンドのクエリの書き換えなどの生成シナリオでない限り、大規模なモデルを使用する必要はありません。

4.2 長いテキスト

これは ASR 翻訳テキストを使用します。

トレーニングセットには 4918 個のサンプルが含まれており、平均長は 740 文字、最大長は 4631 文字、75% に 918 文字が含まれています。

LoRA の微調整では、通常、2 エポック後にはより良い結果が得られるため、ランクパラメータを適切に調整する必要があります。

エポック=1、ランク=96、アルファ=2*ランク

Accuracy:0.8415637860082305 micro Precision:0.8415637860082305 micro Recall:0.8415637860082305 micro F1 Score:0.8415637860082305 macro Precision:0.8075007129137883 macro Recall: 0.770659344467927 macroF1 Score:0.7726373117446225 weightedPrecision:0.8509932419375813 weighted Recall:0.8415637860082305 weighted F1Score:0.8420807262647815 precision recall f1-score support 0 0.95 0.83 0.89 163 1 0.76 0.77 0.77 66 2 0.78 0.89 0.83 63 3 0.81 0.81 0.81 42 4 0.80 0.93 0.86 30 5 0.48 0.56 0.51 18 6 1.00 0.43 0.60 7 7 0.88 0.95 0.92 97

エポック=3、ランク=96、アルファ=2*ランク

Accuracy:0.8847736625514403 micro Precision:0.8847736625514403 micro Recall:0.8847736625514403 micro F1 Score:0.8847736625514403 macro Precision:0.8765027065399982 macroRecall:0.8400805218716799 macro F1 Score:.8527883278910355 weighted Precision:0.8903846924862034 weighted Recall:0.8847736625514403 weighted F1 Score:0.8852820009557909 precision recall fl-score support 0 0.94 0.89 0.91 163 1 0.77 0.85 0.81 66 2 0.81 0.88 0.83 42 3 0.79 0.90 0.86 63 4 1.00 0.93 0.97 30 5 0.92 0.61 0.73 18 6 0.83 0.71 0.77 7 7 0.96 0.94 0.95 97

V. 関連資料

これは、LLM 分類ヘッドを微調整するための非常に詳細なガイドを提供します。ぜひ一読することをお勧めします。

Qiu Zhenyu: 従来の NLP タスクにおける大規模モデルの使用の調査 https://zhuanlan.zhihu.com/p/...

災害ツイート分析シナリオで LoRA を使用して Roberta、Llama 2、Mistral を微調整するプロセスとパフォーマンスの比較: https://segmentfault.com/a/11...

SFT 分類ヘッドの微調整コード (基本的には LoRA のコード行を削除するだけです) https://github.com/muyaostudi..._seq_cls

知乎のコード

Hao Ke'ai: LlamaForSequenceClassification を使用したテキスト分類モデルの構築 https://zhuanlan.zhihu.com/p/...

関連するコードは GitHub で見つかります。また、Kaggle を確認することもお勧めします。これら 2 つが主な場所です。

618ZXW