ISTD PhD Oral Defense presented by Li Xu – Towards Effective, Robust, and Continual Multi-modal Learning
ISTD PhD Oral Defense presented by Li Xu – Towards Effective, Robust, and Continual Multi-modal Learning
Abstract
In the ever-evolving field of artificial intelligence (AI), deep learning has emerged as a pivotal technique driving remarkable advancements across various domains. Among its many branches, multi-modal learning stands out as a particularly significant approach, which involves integrating and processing information from multiple modalities of data, such as visual content and language information, to enhance the capabilities of AI systems. The primary objective of multi-modal learning is to leverage the complementary information present in different modalities to achieve better performance than using any single modality alone, mimicking the way humans perceive and understand the world.
However, given the significant implications of multi-modal learning for both academic research and practical applications, there are some important but largely unexplored challenges in this field. First, strong reasoning ability is fundamental for multi-modal models to effectively perform complex tasks. A comprehensive and challenging benchmark plays a pivotal role in evaluating and enhancing the reasoning capabilities of multi-modal models. Additionally, for evaluating reasoning abilities of multi-modal models, we value not only reasoning accuracy but also computational efficiency. Prediction robustness is another crucial aspect of multi-modal models, as robust models can generalize better to new, unseen data. In addition, multi-modal models often operate in dynamic environments where new data is constantly generated. To handle this challenge, it is crucial to equip multi-modal models with the continual learning ability, which allows these models to adapt to new information and improve over time without forgetting previously learned knowledge.
To handle the above challenges in multi-modal learning, in this thesis, we first investigate the reasoning capability of multi-modal models by introducing a comprehensive dataset for evaluating and enhancing multi-modal model reasoning capabilities. Besides, we design an efficient neural network to achieve reliable and computation-efficient multi-modal video reasoning. To improve the prediction robustness of multi-modal models, we introduce a meta learning-based framework by focusing on novel compositions of learned concepts. Finally, we propose a network that can continuously learn to solve new multi-modal reasoning tasks without forgetting previously learned ones.
Speaker’s Profile
Li Xu is a PhD candidate at the ISTD pillar of Singapore University of Technology and Design, where he is advised by Prof. Jun Liu. Prior to his study at SUTD, he received his B.Eng. Degree from Southeast University in Nanjing, China. His primary research interest is in the areas of computer vision and multi-modal learning. During his PhD study, he received the Runners Up for PREMIA Best Student Paper Awards 2021.