I tried to fine-tune the Whisper model by referring to the article. If want to refer to the code, please look at the colab link.
All I want to do is change the common-voice dataset used in the article to my dataset.
I use a prepared common-voice dataset, it works very well. The common-voice dataset appears to use a pre-cached .arrow file.
Because of this, it is fast, and the whole process is handled well. But using my dataset does not work.
Specifically, it takes a lot of time in the code below and does not work.
common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=2)
In my opinion, this is due to the raw data that was pre-cached. I import the dataset with the simple code below.
My code does not create an array cache file of voice files.
class DataLoader_AIHub: def __init__(self, rootPath): self.rootPath = rootPath def getData(self, max_files_to_load, startPoint=0): rootPath_audio = os.path.join(self.rootPath, 'audio') audioDirPaths = getDirList(rootPath_audio) total_files_loaded = 0 data_list = [] for audioDir in audioDirPaths: audioFileNames = getFileList(audioDir) audioFilePaths = [audioDir +'/'+ str(item) for item in audioFileNames] labelFilePaths = [item.replace('/audio/','/label/').replace('.wav','.json') for item in audioFilePaths] for audioPath, labelPath in zip(audioFilePaths, labelFilePaths): jsonInfo = getJson(labelPath) if '(' in jsonInfo['발화정보']['stt']: continue if startPoint > total_files_loaded: total_files_loaded += 1 continue audio, sr = sf.read(audioPath) audioArray = audio.astype(np.float32) dict = {'audio': {'path': audioPath,'array': audioArray,'sampling_rate': sr },'sentence': re.sub('\r\n', '', jsonInfo['발화정보']['stt']),'age': jsonInfo['녹음자정보']['age'],'gender': jsonInfo['녹음자정보']['gender'] } data_list.append(dict) total_files_loaded += 1 if total_files_loaded >= max_files_to_load + startPoint: return Dataset.from_list(data_list) return Dataset.from_list(data_list)
(It is a Korean dataset.)
Voice files (.wav) are sampled at 16 kHz, and audioArray refers to an array that has been decoded. The .arrow file is presumed to store these decoding arrays.
Am I doing something wrong?