Quantcast
Channel: Recent Questions - Stack Overflow
Viewing all articles
Browse latest Browse all 12111

dataset_text_field of SFTTrainer for parallel corpus

$
0
0

I plan to use SFTTrainer (Supervised Fine-tuning Trainer) to finetune a sequence-to-sequence model. Here is the code:

from datasets import load_datasetfrom trl import SFTTrainerfrom transformers import AutoModelForSeq2SeqLMdataset = load_dataset("my-parallel-corpus", split="train")model = AutoModelForSeq2SeqLM.from_pretrained("my-seq2seq-model")trainer = SFTTrainer(    model,    train_dataset=dataset,    dataset_text_field="text", # what should I put here?    max_seq_length=512,)trainer.train()

Sample data of the train split of my parallel corpus are as follows:

{ "fr": "Bonjour", "en": "Good morning" }{ "fr": "Au revoir", "en": "Goodbye" }

My question is: what should I put as dataset_text_field if the dataset is in HuggingFace JSON string format?


Viewing all articles
Browse latest Browse all 12111

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>