ISTD PhD Oral Defense presented by Haoran Li – Overcoming the Limitations of Autoregressive and Non-Autoregressive Neural Models
ISTD PhD Oral Defense presented by Haoran Li – Overcoming the Limitations of Autoregressive and Non-Autoregressive Neural Models
Abstract
Language models are critical to the advancement of natural language processing and general artificial intelligence. In this thesis, we aim to address the limitations of language models, particularly focusing on the exposure bias in Autoregressive (AR) models and the label bias in Non-Autoregressive (NAR) models.
First, we address the exposure bias in small AR models, such as standard Transformer models, by modifying the training strategy. We propose a mixed cross-entropy loss (mixed CE) to better reconcile the dynamics between training and testing. This approach has demonstrated stronger performance across multiple machine translation benchmarks, confirming its effectiveness.
Next, we explore mitigating exposure bias in large language models like Llama-1/2. Since altering the pre-training strategy is impractical, we enhance the finetuning strategy during instruction tuning. We introduce probabilistic and contextual ranking feedback from stronger teacher models, such as text-davinci-003 and GPT-4, guiding the student model to generate better outputs. This approach has led to improved performance on various test tasks.
We then hypothesize that extremely large language models (ELLMs) like GPT-4 experience minimal exposure bias since their model distribution should be closely aligned with the ground truth data distribution. To test this hypothesis, we design a scalable method called GLAN to generate large-scale synthetic data using ELLMs. We then use this synthetic data to finetune open-source models. The significant performance improvements observed with models finetuned using GLAN-generated synthetic data support our hypothesis.
Besides, we also discuss the label bias in the state-of-the-art NAR model called architecture Directed Acyclic Transformer (DAT). We mitigate the label bias in NAR models by proposing two variants of DAT, showing improved performance in translation tasks.
Looking forward, we anticipate that future AR models will exhibit negligible exposure bias due to the increasingly close alignment of model distributions with ground truth data. This alignment offers potential for generating new data to support continual training. However, new challenges, such as exposure bias in tool-using LLMs, highlight the need for ongoing research. Future efforts should also focus on developing more robust NAR models and exploring new model architectures to enhance generation speed and performance.
Speaker’s Profile
Haoran Li is currently a Ph.D. candidate in the Information Systems Technology and Design (ISTD) pillar at the Singapore University of Technology and Design (SUTD), advised by Prof. Wei Lu. Prior to his study at SUTD, he received his B.Eng degree from University of Electronic Science and Technology of China. His research interests include large language models, autoregressive and non-autoregressive modeling, machine translation.