A Two-Tier Evaluation Framework
Every AI-agent-generated document submitted to TheLegalBench undergoes a structured, two-tier assessment. The framework was developed in consultation with practising lawyers and is designed to capture the full spectrum of quality considerations relevant to legal work in practice, in-house, and commercial contexts.
Gate Check — Pass / Fail
Before detailed scoring begins, each document is screened against three critical failure criteria. A failure at this stage results in automatic disqualification from dimension scoring.
Dimension Scoring — 1–5 Scale
Documents that pass the gate check are assessed across five dimensions, each scored independently on a 1–5 scale with detailed rubric descriptors.
Traffic Light Classification
Dimension scores are mapped to an overall classification:
GREEN — Meets professional standard
All dimensions score 4 or above. The document meets a professional standard and could be used with minimal modification.
AMBER — Reasonable starting point
Any dimension scores 3, with no dimension below 3. Identifiable weaknesses but a reasonable starting point requiring editing.
RED — Fundamental problems
Any dimension scores 2 or below, or the document failed a Tier 1 gate check. Fundamental problems that make the document unsuitable for use.
Inter-Rater Reliability
To ensure consistent, defensible results, a proportion of documents are assessed by multiple evaluators independently. Agreement is measured using Cohen’s kappa, with a target threshold of ≥0.6. Evaluators participate in an initial calibration session to align interpretation of the scoring rubric before the main evaluation window opens.
Evaluation Scorecard
A redacted sample evaluation scorecard will appear here before launch.
Ready to Apply This Framework?
We are recruiting a curated cohort of qualified lawyers to put this framework into practice.
Express Your Interest →