Meta's New Self-Taught Evaluator Revolutionizing AI Model Evaluation with Synthetic Data

TapTechNews August 7th news, in order to alleviate the problem that natural language processing (NLP) technology relies on human annotations to evaluate AI models, Meta company has newly launched the Self-Taught Evaluator and uses synthetic data to train AI.

NPU Technology Challenges

The development of NPU technology promotes large language models (LLMs) to accurately perform complex language-related tasks and achieve more natural human-computer interaction.

However, an important challenge currently faced by NPU technology is that evaluating models relies heavily on manual annotations.

Manually generated data is crucial for training and validating models, but collecting this data is both costly and time-consuming. And as the models improve, the previously collected annotations may need to be updated, reducing their effectiveness when evaluating new models.

The current model evaluation methods usually involve collecting a large number of preference judgments of human responses to the models. These methods include using automatic metrics in tasks with reference answers or using classifiers that directly output scores.

These methods all have limitations, especially in complex scenarios such as creative writing or coding, where there may be multiple valid answers, resulting in high variance problems and high costs of human judgment.

Self-Taught Evaluator

The MetaFAIR team has launched a brand-new way called the Self-Taught Evaluator, which does not require manual annotations but uses synthetic data for training.

This process starts from the seed model, and the seed model will generate contrasting synthetic preference pairs. Then, the model evaluates these preference pairs and continuously improves, and uses its judgments in subsequent iterations to improve performance. This method fully utilizes the model's ability to generate and evaluate data, greatly reducing the reliance on manual annotations.

Metas New Self-Taught Evaluator Revolutionizing AI Model Evaluation with Synthetic Data_0

TapTechNews attaches the key steps as follows:

1. Use the seed LLM to generate a baseline response for a given instruction.

2. Create a modified version of the instruction to prompt the LLM to generate a new response with a quality lower than the original response.

These paired responses form the basis of the training data, and the Self-Taught Evaluator as the LLM-as-a-Judge generates inference trajectories and judgments for these pairs.

Through repeating this process, the model continuously improves the accuracy of its judgments through self-generated and self-evaluated data, thus effectively forming a self-improving cycle.

Results

The MetaFAIR team tested the Self-Taught Evaluator on the Llama-3-70B-Instruct model and increased the accuracy from 75.4 to 88.7 in the RewardBench benchmark test, reaching or exceeding the performance of models trained with human annotations, and outperforming common large language model reviewers (LLMJudges) such as GPT-4.

Metas New Self-Taught Evaluator Revolutionizing AI Model Evaluation with Synthetic Data_1

Metas New Self-Taught Evaluator Revolutionizing AI Model Evaluation with Synthetic Data_2

This significant improvement demonstrates the effectiveness of synthetic data in strengthening model evaluation. In addition, the researchers have also conducted multiple iterations to further improve the functionality of the model.

Reference

Likes