Facial Emotion Detection Using CNN

Introduction

Emotions are an integral part of human communication, influencing everything from personal interactions to professional decisions. With the rapid advancements in Artificial Intelligence, building systems that can recognize and interpret human emotions has become increasingly relevant. In this blog, we’ll take a beginner-friendly walkthrough of a project that explores Facial Emotion Detection using deep learning.

The aim of this project is to develop a model that can classify facial expressions into one of seven emotions: Angry, Disgust, Fear, Happy, Sad, Surprise, and Neutral. From preprocessing raw image data to implementing cutting-edge techniques like transfer learning, we’ll taken systematic steps to build a robust emotion detection system.

We’ll began with a dataset containing labeled images of facial expressions. The data presented unique challenges, including class imbalances and variations in lighting and facial positions. Through a combination of preprocessing techniques and experimentation with various deep learning models, we discovered a solution that not only improved accuracy but also made our model more robust.

Learning Outcomes

By the end of this blog, you’ll learn:

How to handle imbalanced data with techniques like class weights.
The benefits and challenges of using transfer learning in deep learning projects.
How to evaluate a model’s performance using AUC-ROC curves, confusion matrices, and more.
The thought process behind selecting and fine-tuning a pretrained model like ResNet50V2.
How to deploy a trained model using Gradio to make it user-friendly.

Dataset

The dataset used in this project comes from Kaggle and consists of 48x48 pixel grayscale images of human faces. These images are pre-processed to ensure the faces are roughly centered and occupy a consistent space within the frame.

Nature of the Dataset

This dataset contains 7 Emotions :-
0 = Angry, 1 = Disgust, 2 = Fear, 3 = Happy, 4 = Sad, 5 = Surprise, 6 = Neutral

Exploring the Data

To better understand the distribution of classes in the dataset, we’ll visualize the number of images per class for both the training and testing subsets.

def count_files_in_subdirs(directory, set_name):
    counts = {}
    for item in os.listdir(directory):
        item_path = os.path.join(directory, item)
        if os.path.isdir(item_path):
            counts[item] = len(os.listdir(item_path))
    df = pd.DataFrame(counts, index=[set_name])
    return df

train_count = count_files_in_subdirs(train_dir, 'train')
test_count = count_files_in_subdirs(test_dir, 'test')

train_count.transpose().plot(kind='bar')
test_count.transpose().plot(kind='bar')

Bar chart showing counts of different emotions in a dataset labeled "train." "Happy" has the highest count, followed by "neutral," "sad," "angry," "fear," "surprise," and "disgust" with the lowest count.

Example Images from Each Emotion

To gain further insight, we visualized one image for each emotion class from the training data. Here’s the code we used:

emotions = os.listdir(train_dir)
plt.figure(figsize=(15,10))

for i, emotion in enumerate(emotions, 1):
    folder = os.path.join(train_dir, emotion)
    img_path = os.path.join(folder, os.listdir(folder)[42])
    img = plt.imread(img_path)
    plt.subplot(3, 4, i)
    plt.imshow(img, cmap='gray')
    plt.title(emotion)
    plt.axis('off')

Building a Custom CNN Model

To kick things off, we’ll design a Custom Convolutional Neural Network (CNN) from scratch. Starting with a custom architecture allows us to understand the data better, experiment with different layers, and establish a baseline for performance before moving on to more advanced techniques like transfer learning.

Designing the Architecture

The CNN we’ll build consists of multiple convolutional layers, each followed by activation functions and pooling layers. These are paired with fully connected layers at the end to map the learned features to our seven emotion classes. The architecture is simple yet effective for testing initial hypotheses about the data.

Here’s a brief look at the architecture:

model = Sequential()

model.add(Conv2D(32, kernel_size=(3, 3),kernel_initializer="glorot_uniform", padding='same', input_shape=(img_width, img_height, 1)))
model.add(Activation('relu'))
model.add(Conv2D(64, kernel_size=(3, 3), padding='same'))
model.add(Activation('relu'))
model.add(BatchNormalization())
model.add(MaxPooling2D(2, 2))
model.add(Dropout(0.25))

model.add(Conv2D(128, kernel_size=(3, 3), padding='same', kernel_regularizer=regularizers.l2(0.01)))
model.add(Activation('relu'))
model.add(Conv2D(256, kernel_size=(3, 3), kernel_regularizer=regularizers.l2(0.01)))
model.add(Activation('relu'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Conv2D(512, kernel_size=(3, 3), padding='same', kernel_regularizer=regularizers.l2(0.01)))
model.add(Activation('relu'))
model.add(Conv2D(512, kernel_size=(3, 3), padding='same', kernel_regularizer=regularizers.l2(0.01)))
model.add(Activation('relu'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(1024))
model.add(Activation('relu'))
model.add(Dropout(0.5))

model.add(Dense(num_classes))
model.add(Activation('softmax'))

Things to Note:

Padding is set as 'same' so the input and output of the convolutional layer is maintained
We use kernal_intializer as glorot_uniform (which is also the default option)
BatchNormalization is used to normalize the values of one convolutional layer output so that the model is trained efficiently.
kernal_regularizer is used to add a penalty to the model weights preventing it become too large to add generalization and reduce chances of overfitting.
We use Softmax activation function as the nature of the problem i.e Emotion Detection is a multi-class classification problem. ReLU activation will gave any -ve value in output which is not suitable for our usecase. Softmax will provide probability distribution or uniform values from 0 to 1.

Training the Model

We’ll compile the model with the categorical cross-entropy loss function (since it’s a multi-class classification task) and the Adam optimizer for efficient convergence. During training, we’ll monitor the model’s accuracy and loss on both training and validation data.

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(
    train_generator,
    steps_per_epoch=train_steps_per_epoch,
    epochs=10,
    validation_data=validation_generator,
    validation_steps=validation_steps_epoch,
    callbacks=callbacks)

Evaluating the Model

Observations and Limitations

After training, the Custom CNN achieves moderate accuracy on the validation set. However, the model struggles with certain classes, especially those with fewer examples like Disgust. Additionally, variations in lighting and facial orientations make it challenging for the model to generalize well.

These limitations lead us to explore more advanced approaches, including data augmentation and transfer learning, to enhance performance and robustness.

Making Predictions and Visualizing Results

After training and fine-tuning the model, it's time to put it to the test and see how well it classifies facial emotions. In this section, we make predictions on random images from the test dataset and visually evaluate the results.

Emotion_Classes = ['Angry', 'Disgust', 'Fear', 'Happy', 'Neutral', 'Sad', 'Surprise']
batch_size = test_generator.batch_size
Random_batch = np.random.randint(0, len(test_generator) - 1)
Random_Img_Index = np.random.randint(0, batch_size, 10)
fig, axes = plt.subplots(nrows=2, ncols=5, figsize=(10, 5),
                         subplot_kw={'xticks': [], 'yticks': []})

for i, ax in enumerate(axes.flat):
    Random_Img = test_generator[Random_batch][0][Random_Img_Index[i]]
    Random_Img_Label = np.argmax(test_generator[Random_batch][1][Random_Img_Index[i]], axis=0)
    Model_Prediction = np.argmax(model.predict(tf.expand_dims(Random_Img, axis=0), verbose=0), axis=1)[0]
    ax.imshow(Random_Img.squeeze(), cmap='gray')
    color = "green" if Emotion_Classes[Random_Img_Label] == Emotion_Classes[Model_Prediction] else "red"
    ax.set_title(f"True: {Emotion_Classes[Random_Img_Label]}\nPredicted: {Emotion_Classes[Model_Prediction]}", color=color)

plt.tight_layout()
plt.show()

Data Augmentation: Enhancing the Dataset

Data augmentation plays a crucial role in improving the model's generalization ability by artificially increasing the diversity of the training data. In this section, we apply several transformations to our images, such as rotation, shifting, shearing, and flipping, to make the model more robust to different facial expressions and orientations.

We start by setting up an ImageDataGenerator with augmentation options for the training set, ensuring that the model sees varied images during training.

# Initializing the ImageDataGenerator with data augmentation options
data_generator = ImageDataGenerator(
    rescale=1./255,
    rotation_range=40,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    fill_mode='nearest',
    validation_split=0.2  # 20% of the data will be used for validation
)

train_generator = data_generator.flow_from_directory(
    train_data_dir,
    target_size=(img_width, img_height),
    batch_size=batch_size,
    class_mode='categorical',
    color_mode='grayscale',
    subset='training')

validation_generator = data_generator.flow_from_directory(
    train_data_dir,
    target_size=(img_width, img_height),
    batch_size=batch_size,
    class_mode='categorical',
    color_mode='grayscale',
    subset='validation')

The ImageDataGenerator transforms the images during training to introduce variety and make the model less prone to overfitting. For example, random rotations, horizontal flips, and shifts are applied to each image, allowing the model to learn from a wider variety of scenarios.

Below is a sample of one of the augmented images:

img = load_img(image_path, color_mode='grayscale', target_size=(img_width, img_height))
img_array = img_to_array(img)
img_array = img_array.reshape((1,) + img_array.shape)
# Generate augmented images
aug_iter = data_generator.flow(img_array, batch_size=1)
aug_img = next(aug_iter)[0]

We train the same Custom CNN Model again on this augmented data and below is its performance :-

Observation:

After training the custom CNN model on the augmented dataset, we observe a moderate improvement in accuracy, with a training accuracy of 57.64% and validation accuracy of 59.28%. While the model performs decently, there's still room for improvement, particularly in handling the complexity of facial emotion recognition. The performance plateau suggests that further tuning or more advanced architectures might yield better results.

At this stage, we transition to transfer learning by leveraging a pre-trained model, VGG16, which has been trained on large-scale image datasets like ImageNet. Transfer learning allows us to take advantage of the knowledge already embedded in the VGG16 model, enabling faster convergence and better generalization, especially when working with relatively smaller datasets like ours.

Transfer Learning: VGGNet

To build on the limitations of our custom CNN, we now shift to transfer learning with the VGG16 architecture. VGG16, pre-trained on ImageNet, is a powerful feature extractor with millions of parameters that already understand visual patterns like edges, textures, and shapes. By reusing its learned features, we aim to significantly improve the performance of our model on facial emotion recognition. Note: We also freeze up the earlier layers of the model.

Why Freeze Layers?

When utilizing pre-trained models, freezing earlier layers ensures that the basic visual features (like edges or corners) remain intact while reducing computational cost. VGG16’s earlier layers are generalized enough to apply across diverse datasets, making them ideal for feature extraction. However, to adapt the model to our task, we unfreeze the last few layers, allowing them to learn task-specific features such as facial expressions. This strategy strikes a balance between leveraging pre-trained knowledge and fine-tuning for our dataset.

# Load VGG16 excluding the end dense layers
vgg = VGG16(input_shape=(224, 224, 3), include_top=False, weights='imagenet')
# Freeze all layers except the last 3 convolutional blocks
for layer in vgg.layers[:-3]:
    layer.trainable = False
# Adding custom Dense layers
x = Flatten()(vgg.output)
x = Dense(1024, activation='relu', kernel_initializer='he_normal')(x)
x = Dropout(0.5)(x)
x = Dense(512, activation='relu', kernel_initializer='he_normal')(x)
x = Dropout(0.5)(x)
output = Dense(7, activation='softmax', kernel_initializer='he_normal')(x)

model = Model(inputs=vgg.input, outputs=output)
model.compile(
    loss='categorical_crossentropy',
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001),
    metrics=['accuracy']
)
history = model.fit(
    train_generator,
    epochs=50,
    validation_data=test_generator,
    class_weight=class_weights_dict
)

Model Evaluation

After training the VGG16 model, we evaluated its performance on both the training and validation datasets. The results are as follows:

Training Accuracy: 55.93%
Validation Accuracy: 55.00%

While the model shows some improvement compared to our custom CNN, the overall accuracy still leaves room for refinement. To visualize its learning trend, we plotted a line chart comparing training and validation accuracy across epochs.

Confusion Matrix Insights

A closer look at the confusion matrix reveals a significant bias in the model's predictions. Many emotions are misclassified as "happy," indicating that the model is struggling to differentiate between subtle emotional expressions. This could stem from imbalanced training data or the complexity of the task itself.

Transfer Learning - ResNet50

ResNet50, short for Residual Network, is a powerful deep learning architecture known for its ability to overcome vanishing gradient issues through the introduction of residual connections. ResNet50V2, an improved version, incorporates better normalization and gradient flow techniques, making it particularly suitable for complex tasks like emotion detection.

Building on our experience with VGG16, we now explore ResNet50V2 to leverage its deeper architecture and advanced feature extraction capabilities.

Implementation

Here’s how we integrated ResNet50V2 into our pipeline:

from tensorflow.keras.applications import ResNet50V2
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense, Dropout, BatchNormalization

base_model = ResNet50V2(input_shape=(224, 224, 3), include_top=False, weights='imagenet')
def create_resnet50v2_model():
    model = Sequential([
        base_model,  
        Dropout(0.25),
        BatchNormalization(),
        Flatten(),
        Dense(64, activation='relu'),
        BatchNormalization(),
        Dropout(0.5),
        Dense(7, activation='softmax')  # 7 output classes
    ])
    return model

model = create_resnet50v2_model()
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
train_history = model.fit(
    train_generator,
    steps_per_epoch=train_steps_per_epoch,
    epochs=30,
    validation_data=test_generator,
    validation_steps=test_steps_epoch,
    class_weight=class_weights_dict,
    callbacks=callbacks
)

Model Evaluation

The ResNet50V2 model showcased a significant improvement in performance, as reflected in the accuracy metrics. After training for 30 epochs, the model achieved:

Train Accuracy: 62.61%
Validation Accuracy: 60.80%

This improvement highlights ResNet50V2's ability to extract more nuanced features from the dataset, thanks to its deeper architecture and residual connections. The line chart illustrates the steady learning curve of the model, demonstrating consistent improvements in both training and validation accuracy over epochs.

The confusion matrix further validates the model's performance. The prominently dark diagonal boxes indicate that the ResNet50V2 model makes highly accurate predictions across most classes, effectively addressing the shortcomings of earlier models like the custom CNN and VGG16.

This robust performance makes ResNet50V2 a strong candidate for deployment in our emotion detection system.

AUC-ROC plot for each class

To further evaluate the performance of our ResNet50V2 model, we analyzed the AUC-ROC curve, which provides a comprehensive measure of the model's classification ability across all classes. The ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) for each class, and the Area Under the Curve (AUC) quantifies the model's ability to distinguish between classes.

These values indicate that the model excels at predicting emotions such as Happy and Surprise, with AUC scores of 0.88 for both. However, emotions like Fear (0.64) and Sad (0.66) show room for improvement. Overall, the high AUC scores for most classes highlight the model's strong ability to distinguish between multiple emotions effectively.

Model Deployment

After achieving satisfactory results with the ResNet50V2 model, we will deploy the emotion detection system to make it accessible for real-world usage. For deployment, we will utilize Hugging Face Spaces alongside Gradio, providing a user-friendly interface for predictions.

What is Hugging Face Spaces?

Hugging Face Spaces will be a platform that allows developers to deploy machine learning applications as web-based demos easily. It will support multiple frameworks and tools like Gradio, Streamlit, and Flask, making it an excellent choice for hosting lightweight apps with minimal deployment effort.

What is Gradio?

Gradio will be a Python library that simplifies creating interactive web-based interfaces for machine learning models. It will enable users to upload images, text, or other inputs, receive predictions, and interact with the model in real time—all through a clean and intuitive interface.

Deployment Workflow

We will start with a Gradio interface that will be initially tested locally and on Google Colab. While Colab offers quick deployment options, its sessions are temporary (up to 72 hours), which makes it unsuitable for long-term access. To overcome this limitation, we will deploy the app on Hugging Face Spaces.

Below is the deployment code:

import os
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.image import img_to_array
import gradio as gr

MODEL_PATH = 'Final_Resnet50_Best_model.keras'

if not os.path.exists(MODEL_PATH):
    raise FileNotFoundError(f"Model file '{MODEL_PATH}' is missing. Please upload the file to the repository.")
model = tf.keras.models.load_model(MODEL_PATH)

emotion_labels = {'angry': 0, 'disgust': 1, 'fear': 2, 'happy': 3, 'neutral': 4, 'sad': 5, 'surprise': 6}
index_to_emotion = {v: k for k, v in emotion_labels.items()}

def prepare_image(img_pil):
    """Preprocess the PIL image to fit your model's input requirements."""
    img = img_pil.resize((224, 224))
    img_array = img_to_array(img)
    img_array = np.expand_dims(img_array, axis=0)
    img_array /= 255.0 
    return img_array

def predict_emotion(image):
    """Predict the emotion from the uploaded image."""
    processed_image = prepare_image(image)
    prediction = model.predict(processed_image)
    predicted_class = np.argmax(prediction, axis=1)
    predicted_emotion = index_to_emotion.get(predicted_class[0], "Unknown Emotion")
    return predicted_emotion

interface = gr.Interface(
    fn=predict_emotion,
    inputs=gr.Image(type="pil"),
    outputs="text",
    title="Emotion Detection",
    description=(
        "Upload/Click an image or select a sample image to detect the emotion."
    ),
)

if __name__ == "__main__":
    interface.launch()

This Gradio-based app accepts an image as input, preprocesses it to match the model’s requirements, and predicts the emotion using the trained ResNet50V2 model. The app also includes sample images for demonstration purposes.

Access the Deployed Application

You can interact with the deployed application via Hugging Face Spaces:
🔗 Hugging Face Spaces App: Emotion Detection App

The code and resources for this deployment are available in the following GitHub repository:
🔗 GitHub Repository: Emotion Detection GitHub Repo

Endnote

Building the Facial Emotion Detection System has been an insightful journey that underscored the importance of methodical experimentation in data science. From data preprocessing and augmentation to leveraging advanced architectures like VGG16 and ResNet50, each step highlighted how thoughtful model selection and fine-tuning can drastically influence outcomes.

We also explored deployment with Hugging Face Spaces, making the model accessible to end-users and demonstrating how AI solutions can be brought closer to real-world applications. This project reinforced the value of experimentation, the importance of thoughtful architecture selection, and the power of making AI practical.

Thank you for reading this blog! We hope it provided actionable insights for your data science journey. Feel free to share your thoughts or questions—your feedback is invaluable! 🚀

Facial Emotion Detection with Convolutional Neural Networks

Table of contents

Introduction

Learning Outcomes

Dataset

Nature of the Dataset

Exploring the Data

Example Images from Each Emotion

Building a Custom CNN Model

Designing the Architecture

Training the Model

Evaluating the Model

Observations and Limitations

Making Predictions and Visualizing Results

Data Augmentation: Enhancing the Dataset

Observation:

Transfer Learning: VGGNet

Why Freeze Layers?

Model Evaluation

Confusion Matrix Insights

Transfer Learning - ResNet50

Implementation

Model Evaluation

AUC-ROC plot for each class

Model Deployment

What is Hugging Face Spaces?

What is Gradio?

Deployment Workflow

Access the Deployed Application

Endnote