Advancing Sentiment and commonsense-aware multimodal dialogue systems

ISTD

DATE

24 Aug 2020

Dialogue systems have multifaceted applications in customer service, virtual assistants, education, mental health, and many more areas. Such applications are often executed in the form of chatbots due to its flexibility. Its market value evidences the popularity of chatbots–– according to Revechat, 2.6 billion USD in 2019, that is projected to grow to 9.4 billion USD by 2024. Furthermore, 80% of businesses are expected to employ some form of a chatbot by 2020. Such adoptions are often driven by cost savings of up to 30% in customer service.

Creating a human-like conversational system is a long-standing goal of artificial intelligence (AI). However, it is not a trivial task as we, human beings, count on several variables such as emotions, sentiment, prior assumptions, intent, or personality traits to participate in dialogues and monologues (refer to Fig 1). These variables control the language that we generate and the way we understand the language that we hear. Hence, it is instead an overstatement to claim that a generic framework such as seq2seq can generate near-perfect natural language.

Fig 1: The example above is illustrative of how human-like dialogue systems demand deeper language understanding, including sentiment-aware dialogue generation, entity extraction, the understanding reason for the expressed sentiment, tacking multilingual text, code-mixed data where multiple languages are mixed, detecting sarcasm.

DeCLaRe Lab at SUTD

At the DeClare Lab at SUTD (refer to Fig 2), Asst Prof Soujanya Poria and his team focus on developing cutting-edge neural models, based on sound linguistic concepts to solve challenging Natural Language Processing (NLP) tasks. One of the primary goals of the research lab is to create a human-like dialogue understanding system by leveraging key factors such as pragmatics, affect, empathy, multimodal cues, and commonsense.

Fig 2: The DeCLaRe lab at SUTD was set up and led by Asst Prof Poria in 2019 and aims to solve NLP challenges such as dialogue comprehension and generation, commonsense reasoning through powerful, scalable algorithms. To find out more about the hidden messages behind the logo, click here.

The lab’s multi-disciplinary research team, which comprises researchers and collaborators from different parts of the world, is not limited to developing deep-learning models, but instead embraces the curation of various datasets to advance the research infrastructure. This includes creating multimodal monologue and dialogue datasets — MOSEI, MELD dataset, and Mustard — for sentiment, emotion, dialogue act, and sarcasm classification.

Multimodal Language Understanding: Looking Beyond Textual Data

The research focus of this lab started back in 2015 when Asst Prof Poria proposed a novel multi-staged fusion approach powered by deep neural network-based feature extraction from text, facial expressions, and speech for sentiment analysis and emotion recognition in monologues and dialogues. This research called multimodal sentiment analysis is a more complex form of sentiment analysis that extends textual sentiment analysis to vision and speech, particularly useful to analyze sentiment in real-time scenarios such as conversations, YouTube videos. The core concept of multimodal sentiment analysis is multimodal representation learning by fusing multiple modalities to improve the performance of unimodal systems (refer to Fig. 3).

Fig 3: The multimodal prediction model where multiple modalities can be combined to improve the prediction performance.

This approach, published by Empirical Methods in Natural Language Processing (EMNLP), was one of the first deep learning approaches to multimodal sentiment analysis. Asst Prof Poria continued to author several research papers on this topic which were published by top conferences including the Association for the Advancement of Artificial Intelligence (AAAI), Association for Computational Linguistics (ACL), North American Chapter of the Association for Computational Linguistics (NAACL),

He worked alongside respected research collaborators from the University of Michigan (Prof Rada Mihalcea), the Carnegie Mellon University (Prof L.P. Morency), the Instituto Politécnico Nacional (Prof Alexander Gelbukh and Dr. Navonil Majumder), and the Adobe Research (Dr. Niyati Chhaya) to advance the field of multimodal sentiment analysis to new heights. Today, the domain of multimodal language understanding is one of the most critical research areas in AI.

Asst Prof Poria and his teammates embarked on a major research direction that focused on creating a human-like dialogue understanding system by leveraging key factors such as pragmatics, affect, empathy, multimodal cues, and commonsense.

Upon realizing the importance of the context to improve the performance of the multimodal classification systems, they came up with an autoregressive network implemented using long short-term memory (LSTM) to capture the context in dialogue and monologue for sentiment classification.

Although simple, it was one of the early models to showcase the power of contextual information in aiding the sentiment classification benchmark performance. This idea was published by one of the most reputed NLP conferences – the Association of Computational Linguistics (ACL).

Inspired by this success, Asst Prof Poria and the team started inventing new ways of capturing context in a more complex scenario, for instance, in conversations where the contextual information depends on the interlocutors’ relationship.The dialogue understanding concept proposed by Asst Prof Poria and his co-authors from Instituto Politécnico Nacional, University of Michigan, and Carnegie Mellon University was published in IEEE Access 2019 (refer to Fig 4).

Fig 4: The concept of a dialogue system illustrated above explains the interaction among different variables during a dyadic conversation between persons A and B is shown. Grey and white circles represent hidden and observed variables, respectively. P stands for personality, U for utterance, S for interlocutor state, I for interlocutor intent, E for emotion, and Topic for the topic of the conversation. This framework can easily be extended to multi-party conversations.

Recognizing Emotion in Dialogues

The vision of this research has since paved the way for several new research directions, such as Emotion Recognition in Conversations (ERC), Multimodal Sentiment Analysis, and Affective Dialogue Systems.

The field of ERC (refer to Fig 5), a term coined by Asst Prof Poria et al. in 2019, deals with the problem of emotion recognition in dialogs or multi-party conversations by modeling speaker-specific context, interlocutors’ dependency, etc. ERC is an essential task to understand emotions and sentiments in conversations for sentiment-aware dialogue generation. The research team has published several works on ERC at top NLP and AI venues including EMNLP, ACL, AAAI.

Fig 5: Emotion Recognition in Conversations (ERC), a task recently introduced by Asst Prof Poria and his colleagues, was carried out on the popular Friends sitcom television series. The task was to understand the emotion of the speakers in a conversation.

The idea of context and speaker state modeling for emotion recognition and other classification tasks in dialogues has emerged as an essential task in the Natural Language Processing (NLP) research and gained significant attention from the NLP and AI community. In just one and a half years, this research field has produced more than 60 research papers and two workshops were dedicated to it.

Using DialogueRNN for Emotion Recognition and Dialogue Classification

One of the methods to address emotion recognition in conversations using context that has received much attention from the AI community is DialogueRNN. A paper on it was published by the Association for the Advancement of Artificial Intelligence (AAAI) 2019. It uses a complex hierarchical recurrent neural network to model interlocutors’ latent states and relationships to shape the context for dialogue understanding and classification tasks such as state and emotion recognition.

This idea has great potential for applications in many other sequential and contextual classification tasks. At the 2019 Empirical Methods in Natural Language Processing (EMNLP) conference, Asst Prof Poria together with his students and collaborators, published a graph neural network to model context in dialogues, called DialogueGCN, which demonstrated the efficacy of graph neural networks for context modeling in solving a wide range of dialogue understanding tasks.

Detecting emotions in conversations will lead to more emotion-aware and empathetic response generation. Based on this theory, Asst Prof Poria’s team developed an empathetic chatbot. The team aims to deploy this platform for various use cases, such as e-learning and patient care.

Making Sense of Commonsense

Recently, Asst Prof Poria has started working on the commonsense knowledge infusion in deep learning models to improve NLP tasks. In his latest work published by Association for Computational Linguistics (ACL) 2020, he has demonstrated the efficacy of commonsense knowledge to improve domain adaptation.

At present, he is doing active research to extend this idea to enhance dialogue understanding and eventually enrich the generation performance. The illustration below (refer to fig 6) depicts two scenarios where a chat-bot is engaging with users of two different age groups (a young man and a child). When posed with the same query, the chat-bot performs commonsense reasoning to understand the concepts of toys to be played primarily by children and utilizes the age information of the user (present as metadata) to generate tacky and engaging replies. This whole reasoning process requires the bot to infer the relevant concepts and verify whether they are synergistic together, for instance, to verify if they follow common sense.

Fig 6: An illustration of a commonsense-aware dialogue generation.

Commonsense knowledge can substantially improve the dialogue understanding and generation as depicted in the figure below (refer to fig 7). It was found that commonsense knowledge also helps in predicting emotion shifts and understanding the difference between closely related emotion classes such as anger and frustration.

Fig 7: Commonsense knowledge can lead to explainable dialogue understanding. It will help models to understand, reason, and explain events and situations. In this particular example, commonsense inference from a sequence of utterances in a two-party conversation. Person A’s first utterance indicates that he/she is tired of arguing with person B. The tone of the utterance also implies that person B is getting yelled at by person A, which invokes a reaction of irritation in person B. Person B then asks what he/she can do to help and says this while being angry. This again makes person A annoyed and influences him/her to respond with anger. This kind of inferred commonsense knowledge about the reaction, effect, and intent of the speaker and the listener helps in predicting the emotional dynamics of the participants.

Looking forward, Asst Prof Poria believes more exciting research can be done towards creating human-like multimodal conversational agents that would be common sense-aware and empathetic and could understand the human sentiment.

To find out more on the latest research outcomes from Asst Prof Poria and the team, head to the DeCLaRe lab website.

Impact on academia and industry

The dialogue understanding research at DeCLaRe lab has created a substantial impact on both academia and industry. These are a few examples:

The paper “Recent trends in deep learning-based natural language processing” has been awarded the prestigious IEEE CIM outstanding paper award in 2020.
Papers on multimodal sentiment analysis by the DeCLaRe have been cited more than 2000 times, according to Google Scholar.
The repository on Github dedicated to emotion recognition in conversations (ERC) has been starred almost 500 times (https://github.com/declare-lab/conv-emotion), and this research field has witnessed more than 60 research papers published in just 1.5 years; https://github.com/declare-lab/awesome-emotion-recognition-in-conversations.
DialogueRNN framework by the DeCLaRe has been referred to in deep-learning and NLP course modules at top universities: http://www.cs.utsa.edu/~fernandez/deeplearning/index.html and http://www.sci.brooklyn.cuny.edu/~levitan/nlp-psych/
The research by DeCLaRe lab on speaker-specific modeling to enhance context understanding for dialogue classification has been adapted in the industry. For example, Aiberry, a next-generation artificial intelligence (AI) company that leverages deep tech to screen for mental health disorders, has designed its product based on the speaker-specific contextual dialogue understanding research by DeCLaRe lab.
Asst Prof Poria has been invited to multiple talks on dialogue understanding at top international venues such as CICLing 2018, SocialNLP@IJCAI 2019.

Featured in news and blogs

The DeCLaRe lab has also been featured and mentioned by various media outlets and blogs:

DialogueGCN has been featured by Techtimes: https://www.techtimes.com/articles/246226/20191126/graph-convolutional-networks-are-bringing-emotion-recognition-closer-to-machines-here-s-how.htm
DialogueGCN has been featured on multiple blogs:
Personality recognition system developed by the DeCLaRe lab has been featured by Datanami: https://www.datanami.com/2017/09/21/deep-learning-reveals-new-insights-people/
Sarcasm recognition system developed by the DeCLaRe lab has been featured by KDNuggets: https://www.kdnuggets.com/2018/06/detecting-sarcasm-deep-convolutional-neural-networks.html

About Asst Prof Soujanya Poria

Before joining SUTD as an Assistant Professor in 2019, he was with Nanyang Technological University (NTU) as a presidential postdoctoral fellow awardee. Asst Prof Soujanya Poria started his research in NLP with a primary focus on sentiment analysis in 2014.

Soujanya’s research interests include sentiment analysis, multimodal language understanding, dialogue understanding, distilling commonsense information for natural language processing, and semantics. Some of his notable works include aspect extraction, multimodal sentiment analysis using deep learning, emotion recognition in conversations for affective, and empathetic dialogue generation.

To date, Asst Prof Poria’s research articles have been cited almost 8000 times with an h-index of 43, according to Google scholar.

At SUTD, Asst Prof Poria founded the DeCLaRe lab, a research group doing cutting-edge advanced research in NLP.

For more info on Asst Prof Poria, click here.