We introduce ALaRM, the first framework modeling hierarchical rewards in reinforcement learning from human feedback (RLHF), which is designed to enhance the alignment of large language models (LLMs) with human preferences.
The framework addresses the limitations of current alignment approaches, which often struggle with the inconsistency and sparsity of human supervision signals, by integrating holistic rewards with aspect-specific rewards. This integration enables more precise and consistent guidance of language models towards desired outcomes, particularly in complex and open text generation tasks. By employing a methodology that filters and combines multiple rewards based on their consistency, the framework provides a reliable mechanism for improving model alignment.
We validate our approach through applications in long-form question answering and machine translation tasks, employing gpt-3.5-turbo for pairwise comparisons, and demonstrate improvements over existing baselines. Our work underscores the effectiveness of hierarchical rewards modeling in refining LLM training processes for better human preference alignment.
Human oversight capabilities are finite. Demonstrations or preferences get noisy when tasks become complicated. We ask, how to get reliable and scalable supervision signals within limited human oversight capabilities?
As shown in this figure, the shadowed "superior area" better aligns with human preference, which is hard to reach for solely a noisy holistic reward. We propose to utilize multiple rewards hierarchically for more accurate and consistent supervision signals.
While the main results should support the significance of Combination and Hierarchical Structure, we conduct extensive experiments to find out how Reward Selection affects ALaRM.
As shown in the table, the proactively selected rewards present leading performance evaluated by both the holistic reward and gpt-3.5-turbo, demonstrating the effectiveness of Reward Selection.
@article{lai2024alarm,
title={ALaRM: Align Language Models via Hierarchical Rewards Modeling},
author={Lai, Yuhang and Wang, Siyuan and Liu, Shujun and Huang, Xuanjing and Wei, Zhongyu},
journal={arXiv preprint arXiv:2403.06754},
year={2024}
}