METR

Model Evaluation and Threat Research (METR) (MEE-tər), is a nonprofit research institute, based in Berkeley, California,^[1] that evaluates frontier AI models' capabilities to carry out long-horizon, agentic tasks that some researchers argue could pose catastrophic risks to society.^[2]^[3] METR has worked with leading AI companies to conduct pre-deployment model evaluations and contribute to system cards, including OpenAI's o3, o4-mini, GPT-4o and GPT-4.5, and Anthropic's Claude models.^[3]^[4]^[5]^[6]^[7]

Formation2022; 4 years ago

FounderBeth Barnes

TypeNonprofit research institute

Legalstatus501(c)(3) tax exempt charity

Quick facts Formation, Founder ...

METR
Formation	2022; 4 years ago (2022)
Founder	Beth Barnes
Type	Nonprofit research institute
Legal status	501(c)(3) tax exempt charity
Purpose	AI safety research and model evaluation
Website	metr.org

Close

METR's CEO and founder is Beth Barnes, a former alignment researcher at OpenAI who left in 2022 to form ARC Evals, the evaluation division of Paul Christiano's Alignment Research Center. In December 2023, ARC Evals was spun off into an independent 501(c)(3) nonprofit and renamed METR.^[8]^[9]^[10]

Research

A substantial amount of METR's research is focused on evaluating the capabilities of AI systems to conduct research and development of AI systems themselves, including RE-Bench, a benchmark designed to test whether AIs can "solve research engineering tasks and accelerate AI R&D".^[11]^[12]

Doubling time estimates

In March 2025, METR published a paper noting that the length of software engineering tasks that the leading AI model could complete had a doubling time of around 7 months between 2019 and 2024.^[14]

In January 2026, METR released a new version of their time horizon estimates model (Time Horizon 1.1). According to the updated model, the rate of progress of AI capabilities has increased since 2023, with a post-2023 doubling time estimated at 130.8 days (4.3 months). Progress is thus estimated to be 20% more rapid.^[15]

Time horizon measurements

METR releases a "task-completion time horizon" for analysed AI models. This measures the "task duration (measured by human expert completion time) at which an AI agent is predicted to succeed with a given level of reliability."^[16] The metric is reported in two variants: the 50%-time horizon, which gives the task duration at which an AI model is estimated to succeed 50% of the time, and the 80%-time horizon, which gives the task duration at which an AI model is estimated to succeed 80% of the time.^[16] METR has published two versions of the underlying model: Time Horizon 1.0 and Time Horizon 1.1, the latter introduced in January 2026.^[16]

As of 9 May 2026^[update], the best-performing model is Claude Mythos, with a 50%-time horizon of likely at least 16 hours and an 80%-time horizon of 3 hours and 6 minutes.^[16] METR notes that "[m]easurements above 16 [hours] are unreliable with [their] current task suite". The following table provides time horizon estimates ordered by each model's release date:^[16]

More information Model, Release date ...

Task duration (for humans)
Model	Release date	Time Horizon 1.1		Time Horizon 1.0
Model	Release date	50%	80%	50%	80%
GPT-2	February 2019	—	—	2 seconds	0 seconds
GPT-3	May 2020	—	—	9 seconds	2 seconds
GPT-3.5	March 2022	—	—	36 seconds	10 seconds
GPT-4	March 2023	4 minutes	37 seconds	5 minutes	1 minute
GPT-4 (November 2023)	November 2023	4 minutes	34 seconds	9 minutes	1 minute
Claude 3 Opus	March 2024	4 minutes	29 seconds	6 minutes	1 minute
GPT-4 Turbo	April 2024	3 minutes	37 seconds	7 minutes	2 minutes
GPT-4o	May 2024	6 minutes	57 seconds	9 minutes	2 minutes
Qwen2-72B	June 2024	—	—	2 minutes	25 seconds
Claude 3.5 Sonnet (Old)	June 2024	11 minutes	1 minute	19 minutes	3 minutes
Qwen2.5-72B	September 2024	—	—	5 minutes	56 seconds
o1-preview	September 2024	19 minutes	3 minutes	22 minutes	5 minutes
Claude 3.5 Sonnet (New)	October 2024	20 minutes	2 minutes	30 minutes	5 minutes
DeepSeek-V3	December 2024	—	—	18 minutes	4 minutes
o1	December 2024	38 minutes	6 minutes	41 minutes	6 minutes
Claude 3.7 Sonnet	February 2025	1 hour	10 minutes	56 minutes	15 minutes
o3	April 2025	2 hours 1 minute	24 minutes	1 hour 34 minutes	21 minutes
o4-mini	April 2025	—	—	1 hour 19 minutes	16 minutes
Claude Opus 4	May 2025	1 hour 41 minutes	17 minutes	1 hour 26 minutes	21 minutes
DeepSeek-R1-0528	May 2025	—	—	32 minutes	4 minutes
Gemini 2.5 Pro Preview	June 2025	—	—	40 minutes	9 minutes
Grok 4	July 2025	—	—	1 hour 49 minutes	15 minutes
Claude Opus 4.1	August 2025	1 hour 41 minutes	19 minutes	—	—
GPT-5	August 2025	3 hours 34 minutes	32 minutes	2 hours 18 minutes	27 minutes
gpt-oss-120b	August 2025	—	—	45 minutes	7 minutes
Claude Sonnet 4.5	September 2025	—	—	2 hours 2 minutes	21 minutes
Gemini 3 Pro	November 2025	3 hours 57 minutes	43 minutes	—	—
Claude Opus 4.5	November 2025	5 hours 20 minutes	42 minutes	4 hours 49 minutes	27 minutes
GPT-5.1-Codex-Max	November 2025	3 hours 57 minutes	41 minutes	2 hours 53 minutes	32 minutes
Kimi K2 Thinking (inference via Novita AI)	November 2025	—	—	58 minutes	12 minutes
GPT-5.2 (high)	December 2025	6 hours 34 minutes	55 minutes	—	—
Claude Opus 4.6	February 2026	11 hours 59 minutes	1 hour 10 minutes	—	—
GPT-5.3-Codex (high)	February 2026	6 hours 30 minutes	47 minutes	—	—
Gemini 3.1 Pro	March 2026	5 hours 50 minutes	1 hour 30 minutes	—	—
GPT-5.4 (xhigh)	March 2026	5 hours 42 minutes	54 minutes	—	—
Claude Mythos Preview (early)	April 2026	Likely at least 16 hours	3 hours 6 minutes	—	—

Close

Research

Doubling time estimates

Time horizon measurements

References

External links

Related Articles