I want to train a few chosen models (MobileNet, Xception and ResNet50) for a task of facial emotion recognition. I am using the FER2013 dataset, however I don't need to recognize all included emotions, only sad, angry, fearful, neutral and happy. So it's 5 labels in total. Because the dataset is imbalanced, I applied class weights. I'm training the models from scratch with Keras and Tensorflow.
Based on Papers with code (~70% on Inception for example) I would expect to achieve accuracy around 70% or even more, as these results are for the full 7-class dataset.
Unfortunately, the highest the models go is ~65% (Xception), ~62% (ResNet50) and ~63% (MobileNet) before they start to overfit.
For data augmentation I'm using the following transformations:
train_datagen = tf.keras.preprocessing.image.ImageDataGenerator( rescale=1./255, width_shift_range=0.1, height_shift_range=0.1, zoom_range=0.1, fill_mode='constant', cval=0, horizontal_flip=True,)
I'm using SGD optimizer with initial learning rate equal to 1e-3, momentum 0.9 and weight decay of 1e-4 (I have tried to use 1e-6 and 1e-2 with no improvements). Learning rate is halved every 10-epoch stagnation. Batch size is equal to 16 as the size of 8 gave no advancements, only making the traning process longer.
As an example, here are the metrics from training Xception (batch size = 16, initial learning rate = 0.001, momentum = 0.9, weight decay = 1e-4):
Training accuracy:
Testing accuracy:
Training loss:
Testing loss:
The best accuracy for this model was 65.64%.
What could be improved in my training method? Is there any way to achieve better results?