Methodology

A Two-Tier Evaluation Framework

Every AI-agent-generated document submitted to TheLegalBench undergoes a structured, two-tier assessment. The framework was developed in consultation with practising lawyers and is designed to capture the full spectrum of quality considerations relevant to legal work in practice, in-house, and commercial contexts.

Tier 1

Gate Check — Pass / Fail

Before detailed scoring begins, each document is screened against three critical failure criteria. A failure at this stage results in automatic disqualification from dimension scoring.

Gate Check

Criteria

Hallucination

Does the document contain fabricated legal authorities — invented case law, non-existent statutes, or fictitious legal principles?

Critical Legal Error

Does the document contain an error so fundamental that it would expose a client to serious legal risk if relied upon?

Wrong Document Type

Has the AI produced a document type that does not match the instruction given?

Tier 2

Dimension Scoring — 1–5 Scale

Documents that pass the gate check are assessed across five dimensions, each scored independently on a 1–5 scale with detailed rubric descriptors.

Dimension

What It Measures

Accuracy & Appropriateness

Factual correctness, jurisdictional awareness, and calibration of scope to the instruction given.

Legal Soundness & Risk Management

Legal validity, enforceability of provisions, internal consistency, and adequacy of risk allocation.

Fit for Purpose

Whether the document addresses the stated brief, is contextually appropriate, and is practically usable.

Quality & Professionalism

Drafting quality, structural coherence, formatting, and appropriateness of language register.

Transparency & Communication

Clarity of reasoning, inclusion of appropriate caveats, and acknowledgement of limitations.

Classification

Traffic Light Classification

Dimension scores are mapped to an overall classification:

GREEN — Meets professional standard

All dimensions score 4 or above. The document meets a professional standard and could be used with minimal modification.

AMBER — Reasonable starting point

Any dimension scores 3, with no dimension below 3. Identifiable weaknesses but a reasonable starting point requiring editing.

RED — Fundamental problems

Any dimension scores 2 or below, or the document failed a Tier 1 gate check. Fundamental problems that make the document unsuitable for use.

Rigour

Inter-Rater Reliability

To ensure consistent, defensible results, a proportion of documents are assessed by multiple evaluators independently. Agreement is measured using Cohen’s kappa, with a target threshold of ≥0.6. Evaluators participate in an initial calibration session to align interpretation of the scoring rubric before the main evaluation window opens.

Apply

Ready to Apply This Framework?

We are recruiting a curated cohort of qualified lawyers to put this framework into practice.

Express Your Interest →