Overview
IFEval is an evaluation benchmark designed to assess compliance with “verifiable instructions.” These are instructions that can be objectively verified, such as:
- “Write in more than 400 words.”
- “Mention the keyword ‘AI’ at least three times.”
- “Write 450 to 500 words.”
- “Output should be in JSON format.”
- “Include a title enclosed in double square brackets, e.g.,
[[ title ]].”
Dataset Construction
-
Base Prompt Generation
- Prompts are generated with one to three randomly selected verifiable instructions appended at the end.
-
Illogical Prompt Removal
- Few-shot prompting is used to detect and remove illogical prompts.
-
Diversity Enhancement
- Another few-shot prompting method is applied to rephrase prompts for increased linguistic diversity.
-
Manual Verification
- Rephrased prompts are manually checked and edited to ensure correctness.
Evaluation Metrics
- Prompt-level strict-accuracy: Percentage of prompts where all verifiable instructions are followed.
- Instruction-level strict-accuracy: Percentage of verifiable instructions that are followed within prompts.
- Prompt-level loose-accuracy: Prompt-level accuracy computed with a relaxed evaluation criterion.
- Instruction-level loose-accuracy: Instruction-level accuracy computed with a relaxed evaluation criterion.
Please refer to the paper 2311.07911 to read more about this.
