Methodology

A Two-Tier Evaluation Framework

Every AI-agent-generated document submitted to TheLegalBench undergoes a structured, two-tier assessment. The framework was developed in consultation with practising lawyers and is designed to capture the full spectrum of quality considerations relevant to legal work in practice, in-house, and commercial contexts.

Tier 1

Gate Check — Pass / Fail

Before detailed scoring begins, each document is screened against three critical failure criteria. A failure at this stage results in automatic disqualification from dimension scoring.

Gate Check
Criteria
Hallucination
Does the document contain fabricated legal authorities — invented case law, non-existent statutes, or fictitious legal principles?
Critical Legal Error
Does the document contain an error so fundamental that it would expose a client to serious legal risk if relied upon?
Wrong Document Type
Has the AI produced a document type that does not match the instruction given?
Tier 2

Dimension Scoring — 1–5 Scale

Documents that pass the gate check are assessed across five dimensions, each scored independently on a 1–5 scale with detailed rubric descriptors.

Dimension
What It Measures
Accuracy & Appropriateness
Factual correctness, jurisdictional awareness, and calibration of scope to the instruction given.
Legal Soundness & Risk Management
Legal validity, enforceability of provisions, internal consistency, and adequacy of risk allocation.
Fit for Purpose
Whether the document addresses the stated brief, is contextually appropriate, and is practically usable.
Quality & Professionalism
Drafting quality, structural coherence, formatting, and appropriateness of language register.
Transparency & Communication
Clarity of reasoning, inclusion of appropriate caveats, and acknowledgement of limitations.
Classification

Traffic Light Classification

Dimension scores are mapped to an overall classification:

GREENMeets professional standard

All dimensions score 4 or above. The document meets a professional standard and could be used with minimal modification.

AMBERReasonable starting point

Any dimension scores 3, with no dimension below 3. Identifiable weaknesses but a reasonable starting point requiring editing.

REDFundamental problems

Any dimension scores 2 or below, or the document failed a Tier 1 gate check. Fundamental problems that make the document unsuitable for use.

Rigour

Inter-Rater Reliability

To ensure consistent, defensible results, a proportion of documents are assessed by multiple evaluators independently. Agreement is measured using Cohen’s kappa, with a target threshold of ≥0.6. Evaluators participate in an initial calibration session to align interpretation of the scoring rubric before the main evaluation window opens.

Sample

Evaluation Scorecard

A redacted sample evaluation scorecard will appear here before launch.

Apply

Ready to Apply This Framework?

We are recruiting a curated cohort of qualified lawyers to put this framework into practice.

Express Your Interest →