Artificial Intelligence — Learning by Doing
Project: Speech Emotion Recognition using CNN with CREMA-D dataset using Google Colab
Part 1: Exploratory Data Analysis of CREMA-D dataset
It’s the year where ChatGPT has become ‘talk of the town’. While ChatGPT is an “all-purpose” AI tool using human input in textual form, there are many other AI branch which could be explored. Among them are computer vision, natural language processing (including texts and speeches) and robotics. In this article chain, we will be doing a project for something called “Speech Emotion Recognition” where this task belongs to sub-branch ‘Sentiment Analysis’ under branch ‘Natural Language Processing’. For a general overview of AI branches you can refer to my previous article: Artificial Intelligence, why does it matter in engineering?
Note: The work itself initiated thanks to 360DigiTMG for the Capstone Project using Ravdess. However, the work published in this article is after the redo it with CREMA-D.
There will be 3 parts of this article: Part 1 — Exploratory Data Analysis, where the generality of the task will be explained and we will dig further to understand our chosen dataset (CREMA-D), Part 2 — Feature Extraction and Model Training, where we will train a CNN model and get the accuracy, and improve if necessary (3) Part 3 — Deployment on Web Application where we will deploy this model on Streamlit.
Before we dig further, let us learn a bit what is ‘Speech Emotion Recognition’ or in short, SER. SER is letting AI identify or predict the emotion of someone based on his/her speech. To simplify the explanation, below is a simple flow diagram showing what SER AI would do.
About CREMA-D Dataset
Before we go further, it is important to understand a little bit about the dataset itself.
CREMA-D is a data set of 7,442 original clips from 91 actors. These clips were from 48 male and 43 female actors between the ages of 20 and 74 coming from a variety of races and ethnicities (African America, Asian, Caucasian, Hispanic, and Unspecified).
Actors spoke from a selection of 12 sentences. The sentences were presented using one of six different emotions (Anger, Disgust, Fear, Happy, Neutral and Sad) and four different emotion levels (Low, Medium, High and Unspecified).
The filenames of each CREMA-D audio files are following a naming convention consists of 4 blocks, separated by underscores(_). Example is shown as per image below.
- The first block: Actor ID
- The second block: Sentence selection (out of 12 types)
- The third block: Emotion of the spoken sentence (out of 6 categories)
- The fourth block: Emotion level (out of 4 choices)
Based on the example in image above, “1001” is the actor ID, “DFA” is the sentence which is “Don’t forget a jacket”, “ANG” is angry emotion and “XX” is unspecified intensity. Click play on audio button below to hear this example data.
To download all the dataset and read further detail especially on the types and options of the filename blocks, you can go to the original Github Repository for CREMA-D or Kaggle.
Step-by-step Exploratory Data Analysis of CREMA-D Dataset
Now that we have basic understanding of the data, let us go deeper into audio data exploration in technical aspect.
Step 1: Download dataset from Kaggle.
- Go to this URL: https://www.kaggle.com/datasets/ejlok1/cremad and click download button. Then, upload the files into a new project folder in your Google Drive. This normally will take some time depending your network download and upload speed.
- Alternatively, you can try directly download data from website above into Google Drive by following this article: How to Download Kaggle Dataset into Google Colab via Google Drive .
Step 2: Create a new notebook in Google Colab at this URL: https://colab.research.google.com/.
- Then, mount drive to be able to access dataset we just saved in Google Drive earlier. For first timers, normally Colab will request a permission to your Google Account to mount the drive. After mounting, you can access Google Drive files from Colab.
- You can rename the notebook on top if you wish. In my case, I rename it as “CREMA-D.ipynb”
- To make accessing the files more convenient, you can change directory of the code using “cd” command. Example as follows:
%cd /content/drive/MyDrive/CREMA-D/AudioWAV
Step 3: Import necessary Libraries using the following command:
import os
import pandas as pd
import matplotlib.pyplot as plt
import librosa
import numpy as np
import librosa.display
import IPython.display as ipd
Step 4: Classify dataset (grouping) using the following set of commands.
- Create one variable ‘CREMA’ as the folder path for all audio data files:
CREMA = '/content/drive/MyDrive/CREMA-D/AudioWAV/'
- Get all filenames into a list and display the first 5:
#display first 5 data
dir_list = os.listdir(CREMA)
dir_list.sort()
dir_list[:5]
- Classify the data (list format) based on the file name:
emotionG = []
gender = []
emotionO = []
path = []
female_ids = [1002,1003,1004,1006,1007,1008,1009,1010,1012,1013,1018,1020,1021,
1024,1025,1028,1029,1030,1037,1043,1046,1047,1049,1052,1053,1054,
1055,1056,1058,1060,1061,1063,1072,1073,1074,1075,1076,1078,1079,
1082,1084,1089,1091]
temp_dict = {"SAD":"sad", "ANG": "angry", "DIS":"disgust", "FEA":"fear",
"HAP":"happy", "NEU":"neutral"}
def get_emotion_crema(filename, ids=female_ids, dc=temp_dict):
filename = filename.split("_")
emotionG1 = dc[filename[2]]
if int(filename[0]) in ids:
emotionG2 = "_female"
else:
emotionG2 = "_male"
emotionG = emotionG1 + emotionG2
return (emotionG, emotionG1, emotionG2[1:])
for i in dir_list:
emotionG.append(get_emotion_crema(i)[0])
emotionO.append(get_emotion_crema(i)[1])
gender.append(get_emotion_crema(i)[2])
path.append(CREMA + i)
- Create pandas dataframe (table) based on the list we had earlier:
CREMA_df = pd.DataFrame(emotionG, columns = ['emotionG_label'])
CREMA_df['source'] = 'CREMA'
CREMA_df = pd.concat([CREMA_df,pd.DataFrame(gender, columns = ['gender'])],axis=1)
CREMA_df = pd.concat([CREMA_df,pd.DataFrame(emotionO, columns = ['emotion'])],axis=1)
CREMA_df = pd.concat([CREMA_df,pd.DataFrame(path, columns = ['path'])],axis=1)
Step 5: Display and analyze dataset distribution, using the following set of commands:
- See the table itself with first 5 data:
CREMA_df.head()
- See the count of each emotion+gender:
CREMA_df.emotionG_label.value_counts()
- Or better, make a pivot table:
CREMA_df_summary = CREMA_df.pivot_table(index='emotion', columns='gender', aggfunc=len, values = 'source')
CREMA_df_summary
- And plot it:
CREMA_df_summary.plot(kind='barh')
Step 6: Get a random audio file, and use a package we imported earlier called “Librosa” to read and plot the audio:
- Waveplot:
n_files = CREMA_df.shape[0]
# choose random number
rnd = np.random.randint(0,n_files)
# use the Librosa library to load and plot the random speech
fname = CREMA_df.path[rnd]
data, sampling_rate = librosa.load(fname, sr=44100)
plt.figure(figsize=(15, 5))
info = CREMA_df.iloc[rnd].values
title_txt = f'{info[2]} voice - {info[0]} emotion (speech) - [{os.path.basename(fname)}]'
plt.title(title_txt.upper(), size=16)
librosa.display.waveplot(data, sr=sampling_rate)
# play the audio
ipd.Audio(fname)
Above is the result of the wave plot of an angry female audio, you could even play the sound to hear it.
- Linear Spectogram:
X = librosa.stft(data)
Xdb = librosa.amplitude_to_db(abs(X))
plt.figure(figsize=(15, 5))
librosa.display.specshow(Xdb, sr=sampling_rate, x_axis='time', y_axis='hz')
plt.colorbar(format="%+2.0f dB")
plt.title(title_txt.upper(), size=16)
plt.show()
Step 7: You can also plot multiple random files and put side by side to see the differences between emotions, via the following code:
def plot_speech(fname, ind, axis):
data, sampling_rate = librosa.load(fname)
plt.figure(figsize=(15, 5))
librosa.display.waveplot(data, sr=sampling_rate, alpha=.2, ax = axis)
emotions_list = list(temp_dict.values())
fig, ax = plt.subplots(len(emotions_list), 1, figsize=(15, 10))
for n, emotion_ in enumerate(emotions_list):
pdata = CREMA_df.loc[CREMA_df.emotion == emotion_]
for i, speech_file in enumerate(pdata.path.values[:20]):
plot_speech(speech_file, i, ax[n])
ax[n].set_title(emotion_.upper(), size=16)
plt.show();
plt.savefig('emotions.png');
That’s all for this EDA part of this project (which basically the most easy and fun part). See below for the full code:
In the second part of this article, we are going to extract important features of each individual dataset and start training model based on chosen feature. I will put some basic explanation of the audio features we are going to extract.
Thanks for reading, hope you enjoyed the article and the activity that comes along. If you like this content, do follow me and stay tuned for more!
(Note: For now I do not have a strict timeline for publishing the second part , but I am going to do my best to write and publish it ASAP)