Instruction-Following Eval (IFEval)

Overview

IFEval is an evaluation benchmark designed to assess compliance with “verifiable instructions.” These are instructions that can be objectively verified, such as:

“Write in more than 400 words.”
“Mention the keyword ‘AI’ at least three times.”
“Write 450 to 500 words.”
“Output should be in JSON format.”
“Include a title enclosed in double square brackets, e.g., [[ title ]].”

Dataset Construction

Base Prompt Generation
- Prompts are generated with one to three randomly selected verifiable instructions appended at the end.
Illogical Prompt Removal
- Few-shot prompting is used to detect and remove illogical prompts.
Diversity Enhancement
- Another few-shot prompting method is applied to rephrase prompts for increased linguistic diversity.
Manual Verification
- Rephrased prompts are manually checked and edited to ensure correctness.

Evaluation Metrics

Prompt-level strict-accuracy: Percentage of prompts where all verifiable instructions are followed.
Instruction-level strict-accuracy: Percentage of verifiable instructions that are followed within prompts.
Prompt-level loose-accuracy: Prompt-level accuracy computed with a relaxed evaluation criterion.
Instruction-level loose-accuracy: Instruction-level accuracy computed with a relaxed evaluation criterion.

Please refer to the paper 2311.07911 to read more about this.

Explorer

Instruction-Following Eval (IFEval)

Overview

Dataset Construction

Evaluation Metrics

Graph View

Table of Contents