A partner asks two associates to independently research whether a non-compete clause is enforceable. Each queries the firm’s AI tool. One associate receives case law supporting enforcement, and the other finds authorities that suggest the clause is overbroad. Both memos land on the partner’s desk with opposite conclusions.
This scenario is neither a one-time glitch nor a user error. New empirical data suggest that AI output inconsistency is prevalent in current, state-of-the-art GenAI technology.
Legal AI adoption is surging. From document review to drafting to legal research, legal professionals are integrating AI into everyday workflows. Among this technology’s most compelling promises is consistency: the machine that treats like cases alike, free from the variability that plagues human judgment. Indeed, consistency is a core concept in law, and enhancing consistency contributes to fundamental values of equality, predictability, and legitimacy.
Many have written about AI’s potential to reinforce consistency in legal tasks, touting AI tools as “impeccably consistent” arbiters that eliminate the psychological and physical factors—such as fatigue and mood swings—that affect humans.1 The intuition is understandable. AI is software, and software is deterministic; typing 2+2 in a calculator will always answer 4. Unlike a human, whose decisions for identical inputs might shift based on time of day or cognitive load, an algorithm should deliver identical outputs.
But does it? A systematic empirical investigation reveals a different reality, showing that the AI consistency assumption rests on unstable ground.
Putting Consistency to the Test
A systematic, large-scale empirical investigation conducted by the author tested the consistency of four top-tier large language models: GPT-4o, GPT-4.1, Claude Sonnet 4, and Claude Opus 4. These models, run by two leading AI companies—OpenAI and Anthropic—are among the most popular AI models.2 Importantly, these models serve as the foundational, under-the-hood engines powering many legal AI platforms.3a The study administered legal knowledge, legal analysis, and legal research tasks spanning constitutional law, contracts, civil and criminal procedure, intellectual property, and other domains. Each task was run for 1,000 iterations under strict configurations designed to maximize output consistency.
Importantly, in most cases, legal practice does not require verbatim identical outputs but semantically consistent outputs. For instance, if asked “Is this contract clause enforceable?”, the responses “No” and “Unenforceable” use different words but convey the same answer. The study’s methodology accounts for this distinction, measuring semantic consistency rather than mere lexical identity.
The core finding is surprising: even under controlled conditions, identical prompts produced divergent legal conclusions. The top-performing model (Sonnet 4) achieved only 57% consistency—meaning both reasoning and outcome aligned—on complex legal tasks. Models demonstrated much higher consistency for low-complexity tasks; however, even under these simplified circumstances, the best performer (Opus 4) still produced misaligned conclusions roughly one in fifteen attempts.
Among the key patterns, the data reveal a striking task-type hierarchy. Knowledge tasks, structured as multiple-choice questions similar to bar examination formats, achieved near-perfect consistency rates of 97–100% (varying by model). Legal analysis, applying legal principles to novel fact patterns, dropped to 17–61% consistency. Research tasks, requiring identification and citation of relevant authorities, performed worst: all models scored below 26% consistency. For tasks of the same types of lower complexity (i.e., fewer and shorter subtasks), the consistency performance range was 97-100% for knowledge, 62-92% for analysis, and 21-75% for research.
Interestingly and perhaps ironically, AI consistency excels precisely where lawyers need the least assistance, rote legal recall, and plunges dramatically where support is most needed, analysis and research tasks.
Why do AI models generate inconsistent outputs even when given the exact same prompt? And why don’t AI companies fix this issue? Inconsistency in language models stems from fundamental architectural and computational features of how these models operate. While the technical details are complex, the point is that under the current ecosystem, it is virtually impossible to completely eliminate inconsistency from top-tier language models. Moreover, while consistency is essential in legal settings, it may be undesirable for other fields, like creative content generation, that can benefit from output variations. Namely, achieving complete consistency, even if it were possible, is not necessarily an objective for many AI companies in the first place, as they serve a wide array of users, not only from the legal field.
Navigating AI Inconsistency
These findings extend beyond direct interaction with general-purpose models like ChatGPT or Claude. Many specialized legal AI platforms build on these foundational models. In other words, inconsistency in base models, such as those tested in this research, cascades into the specialized, legal AI tools’ outputs.
The practical implications and risks are significant. Two partners at the same firm could receive different outputs and accordingly come to opposing conclusions. If used for judicial decisions, two highly similar cases may be decided differently. For compliance officers and regulatory counsel, such variations introduce risk that many practitioners have not yet accounted for in their workflows.
One might correctly respond that human decision-makers are inconsistent, too. For instance, correlations were found between judges’ political views and their legal decision-making patterns. So, isn’t AI inconsistency simply par for the course? However, the study compares AI and human inconsistency, uncovering that these two ostensibly similar behaviors are fundamentally different phenomena, with distinct causes and characteristics. In brief, human inconsistency tends to follow identifiable patterns tied to physical and psychological factors. AI inconsistency, by contrast, manifests as random noise, i.e., stochastic variation without discernible triggers. Therefore, integrating AI into legal workflows does not simply add more of the same problem. It diversifies the types of inconsistency practitioners must become familiar with and manage, making the inconsistency challenge more complex rather than merely larger in scale.
Several strategies can help mitigate AI inconsistency. The first strategy is task decomposition. The data show dramatic gains when moving from high-complexity to low-complexity conditions. Accordingly, breaking long queries with multiple subtasks into shorter queries with one task per prompt is expected to improve consistency significantly. Second, multiple-run verification: running the same query several times and comparing outputs can surface inconsistencies before they cause harm. Third, maintain a human-in-the-loop approach and treat AI outputs as preliminary drafts requiring professional verification rather than finished work product; this strategy is a safeguard not only for inconsistent outputs but also for low-quality or fabricated responses.
None of these solutions comes without cost. Task decomposition and human-in-the-loop review add workflow complexity. Multiple runs increase latency and expense. Moreover, these strategies mitigate, but do not eliminate, inconsistency. Mitigation is possible, but practitioners should calibrate expectations accordingly.
AI technology offers genuine value to the legal domain in efficiency and access to justice. However, this technology is not perfect as the data show. The findings do not mean AI is useless, but rather advise informed adoption: inconsistency is a possible complication practitioners should take into account and control when using AI. The study helps in understanding where AI performs reliably, enabling skillful, responsible AI use. Awareness of AI’s actual limitations is key to smart integration.
Please follow this link to view the full study.
Endnotes
1. Jaya Ramji-Nogales, Andrew I. Schoenholtz & Philip G. Schrag, Refugee Roulette: Disparities in Asylum Adjudication, 60 Stan. L. Rev. 295 (2007); James M. Anderson, Jeffrey R. Kling & Kate Stith, Measuring Inter-Judge Sentencing Disparity: Before and After the Federal Sentencing Guidelines, 42 J. L. & Econ. 271 (1999) (“The results show that the expected difference between two typical judges in the average sentence length was about 17 percent (or 4.9 months) in 1986-87”); Alma Cohen & Crystal S. Yang, Judicial Politics and Sentencing Decisions, 11 Am. Econ. J.: Econ. Pol’y 160 (2019).
2. 2024 Artificial Intelligence TechReport, Am. Bar Ass’n, https://www.americanbar.org/groups/law_practice/resources/tech-report/2024/2024-artificial-intelligence-techreport/ (last visited Nov. 30, 2025).
3. Thomson Reuters CoCounsel Tests Custom LLM from OpenAI, Broadening its Multi-Model Product Strategy, PR Newswire (Nov. 25, 2024), https://www.prnewswire.com/news-releases/thomson-reuters-cocounsel-tests-custom-llm-from-openai-broadening-its-multi-model-product-strategy-302314877.html; Bob Ambrogi, LexisNexis Enters the Generative AI Fray with Limited Release of New Lexis+ AI, Using GPT and other LLMs, LawNext (May 2023), https://www.lawnext.com/2023/05/lexisnexis-enters-the-generative-ai-fray-with-limited-release-of-new-lexis-ai-using-gpt-and-other-llms.html.; LexisNexis and OpenAI Announce Plan to Deliver Custom AI Technology for Legal Professionals, LexisNexis Legal & Prof’l (2024), https://www.lexisnexis.com/community/pressroom/b/news/posts/lexisnexis-and-openai-announce-plan-to-deliver-custom-ai-technology-for-legal-professionals; Harvey transforms legal work with Claude, Anthropic, https://www.anthropic.com/customers/harvey (last visited Nov. 30, 2025); Customizing models for legal professionals, OpenAI, https://openai.com/index/harvey/ (last visited Nov. 30, 2025).

