LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction

Abstract

Existing methods to fine-tune LLMs, like Adapter, Prefix-tuning, and LoRA, which introduce extra modules or additional input sequences to inject new skills or knowledge, may compromise the innate abilities of LLMs.

In this paper, we propose LLaMA-Excitor, a lightweight method that stimulates the LLMs' potential to better follow instructions by gradually paying more attention to worthwhile information.

Specifically, the LLaMA-Excitor does not directly change the intermediate hidden state during the self-attention calculation of the transformer structure. We designed the Excitor block as a bypass module for the similarity score computation in LLMs' self-attention to reconstruct keys and change the importance of values by a set of learnable prompts. LLaMA-Excitor ensures a self-adaptive allocation of additional attention to input instructions, thus effectively preserving LLMs' pre-trained knowledge when fine-tuning LLMs on low-quality instruction-following datasets.

Furthermore, we unify the modeling of multi-modal tuning and language-only tuning, extending LLaMA-Excitor to a powerful visual instruction follower without the need for complex multi-modal alignment.

Our approach is evaluated in language-only and multi-modal tuning experimental scenarios. Compared with LLaMA-7B, LLaMA-Excitor is the only PEFT method that maintains basic capabilities and achieves +3.12% relative improvement on the MMLU benchmark. In the visual instruction tuning, we achieve a new state-of-the-art image captioning performance of 157.5 CIDEr on MSCOCO, and a comparable performance (88.39%) on ScienceQA to cutting-edge models with more parameters and extensive vision-language pertaining.

Pipeline

LLaMA-Excitor indirectly involves learnable information in the reasoning process by changing the similarity matrices. It ensures that the hidden states are within the original distribution of LLaMA.

Multi-modal Extention

LLaMA-Excitor uniformly models multi-modal and language-only tuning and extends language models into powerful vision-language models in a low-budget way.

Qualitative Results

Textual Instruction-Following

LLaMA-Excitor provides more details and better character identification.

Visual Instruction-Following

LLaMA-Excitor can accurately cover the content of human annotations and provide richer details

Quantitative Results

Image Captioning on MSCOCO

LLaMA-Excitor significantly surpassing the previous SOTA!

Question Answering on ScienceQA

LLaMA-Excitor demonstrates comparable performance with SOTAs, although we avoid heavy pretraining for vision-language alignment and massive updating of LLM parameters.

BibTeX

@inproceedings{
      anonymous2024llamaexcitor,
      title={{LL}a{MA}-Excitor: General Instruction Tuning via Indirect Feature Interaction},
      author={Anonymous},
      booktitle={Conference on Computer Vision and Pattern Recognition 2024},
      year={2024},
      url={https://openreview.net/forum?id=e7lM7tGXC9}
}