I plan to use SFTTrainer
(Supervised Fine-tuning Trainer) to finetune a sequence-to-sequence model. Here is the code:
from datasets import load_datasetfrom trl import SFTTrainerfrom transformers import AutoModelForSeq2SeqLMdataset = load_dataset("my-parallel-corpus", split="train")model = AutoModelForSeq2SeqLM.from_pretrained("my-seq2seq-model")trainer = SFTTrainer( model, train_dataset=dataset, dataset_text_field="text", # what should I put here? max_seq_length=512,)trainer.train()
Sample data of the train split of my parallel corpus are as follows:
{ "fr": "Bonjour", "en": "Good morning" }{ "fr": "Au revoir", "en": "Goodbye" }
My question is: what should I put as dataset_text_field
if the dataset is in HuggingFace JSON string format?