Overview

IFEval is an evaluation benchmark designed to assess compliance with “verifiable instructions.” These are instructions that can be objectively verified, such as:

  • “Write in more than 400 words.”
  • “Mention the keyword ‘AI’ at least three times.”
  • “Write 450 to 500 words.”
  • “Output should be in JSON format.”
  • “Include a title enclosed in double square brackets, e.g., [[ title ]].”

Dataset Construction

  1. Base Prompt Generation

    • Prompts are generated with one to three randomly selected verifiable instructions appended at the end.
  2. Illogical Prompt Removal

    • Few-shot prompting is used to detect and remove illogical prompts.
  3. Diversity Enhancement

    • Another few-shot prompting method is applied to rephrase prompts for increased linguistic diversity.
  4. Manual Verification

    • Rephrased prompts are manually checked and edited to ensure correctness.

Evaluation Metrics

  • Prompt-level strict-accuracy: Percentage of prompts where all verifiable instructions are followed.
  • Instruction-level strict-accuracy: Percentage of verifiable instructions that are followed within prompts.
  • Prompt-level loose-accuracy: Prompt-level accuracy computed with a relaxed evaluation criterion.
  • Instruction-level loose-accuracy: Instruction-level accuracy computed with a relaxed evaluation criterion.

Please refer to the paper 2311.07911 to read more about this.