---
title: AI agent autonomously handles 30%+ of a typical knowledge-work job
status: draft
dimensions: ["labor","education","travel"]
horizon: medium
trigger: An AI agent autonomously completes ≥ 30% of tasks in a typical knowledge-work role (e.g. legal review / software dev / content / customer support) without human review on each step. 'Typical' = the modal task distribution of the role per O*NET or comparable taxonomy.
timeline: {"p10":2027,"p50":2029,"p90":2034}
confidence: medium
sub_gates: [{"slug":"metr-time-horizon-1-week","p50":2028,"why":"METR 50% time horizon crosses 40 working hours — agents can chain a typical week of work."},{"slug":"swe-bench-pro-75pct","p50":2027,"why":"Contamination-resistant SWE-bench Pro at 75%+ implies SWE agents handle the modal task autonomously."},{"slug":"tau2-bench-policy-90pct","p50":2028,"why":"Customer-service agents hit 90% on policy-adherent multi-turn flows — clears the compliance bar for unsupervised deployment."},{"slug":"agent-cost-per-task-sub-1usd","p50":2027,"why":"Frontier agent task cost drops below $1 for a typical 10-minute knowledge-work unit, enabling per-task economics."},{"slug":"long-context-1m-reliable","p50":2026,"why":"1M-token context with reliable recall across multi-day sessions — required for handling a real job's context window."},{"slug":"agent-orchestration-prod-maturity","p50":2027,"why":"Multi-agent orchestration (Claude Code subagents / Devin-style) ships GA at top-3 enterprise vendors."}]
cross_gate: [{"other":"humanoid-retail-20k","relation":"correlates","strength":"weak","note":"Both are 'automation gates' but in different domains; share macro labor-policy backlash but capability progress is largely independent."},{"other":"ai-tutor-k8-parity-20mo","relation":"enables","strength":"strong","note":"Same underlying agent stack — long-horizon, tool-using, multi-turn. If knowledge-work agents cross 30%, K-8 tutors at parity is essentially a packaging problem."},{"other":"construction-robot-40pct-labor","relation":"correlates","strength":"weak","note":"Embodiment is the binding constraint there, not cognition; agentic LLM progress helps planning layer only."},{"other":"autonomous-freight-delivery","relation":"correlates","strength":"medium","note":"Both depend on long-horizon reliability and regulatory acceptance of unsupervised autonomous action — common bottleneck."},{"other":"robotaxi-unit-economics-5-cities","relation":"correlates","strength":"weak","note":"Different perception/control stack but same regulator question: 'when do we let it act unsupervised?'"}]
external_calibration: {"metaculus":"https://www.metaculus.com/questions/11188/ai-as-a-competent-programmer-before-2030/","manifold":"https://manifold.markets/ahalekelly/will-ai-cause-the-us-unemployment-r","expert_consensus":"McKinsey Global Institute (2023, reaffirmed 2025): up to 30% of US work hours automatable by 2030 with genAI; Anthropic Economic Index (Mar 2026) shows API automation already dominant for coding/customer-service tasks; METR projects week-long autonomous tasks by 2027-2028."}
last_updated: "2026-05-13T00:00:00.000Z"
sources_count: 22
---

## TL;DR

I put the **P50 at 2029** — within 3 years from today (May 2026) — that an AI agent will autonomously complete ≥30% of tasks in at least one typical knowledge-work role, measured against an O*NET-style task inventory and without per-step human review. The headline thesis: customer support has already crossed this threshold inside Klarna-style enterprises (67% of conversations handled end-to-end), and software engineering is now visibly mid-cross with Claude Opus 4.6 handling 14.5-hour solo tasks at 50% reliability per METR and Devin merging 67% of its PRs at thousands of companies. The remaining gap to "modal knowledge-work job" is not capability — it's **integration friction**, **policy and liability**, and **the difference between completing a benchmark task and completing the median task an actual human does**, which usually involves messy context, organizational knowledge, and ambiguous requirements. **P10 = 2027** (one role, fast-moving enterprise like Klarna or a Cognition deployment, formally measured against O*NET); **P90 = 2034** (deep recession in agent capability progress or a hard regulatory clamp pushes the threshold out a decade). The METR 4-month doubling regime — if it continues — is the single most important quantitative driver: it puts week-long tasks at 50% reliability in ~mid-2027, and once frontier agents can chain a week of work, surpassing 30% of a typical role's task distribution is mechanical.

## Current state (as of 2026-05-13)

The 30% threshold is already crossed in some specific deployments, but not in any agreed-upon "typical knowledge-work job" measured against O*NET. Three hard numbers anchor where we are:

- **METR time horizon**: Claude Opus 4.6 reaches a 50%-reliability time horizon of **14.5 hours** (up from ~50 minutes for Claude 3.7 Sonnet 14 months earlier). The doubling rate since 2024 is **89 days** under TH 1.1 — i.e., roughly 4 months [1][2].
- **SWE-bench Verified**: GPT-5.5 leads at **88.7%** and Claude Opus 4.7 follows at **87.6%** as of April 2026, though the benchmark is now contamination-suspect; **SWE-bench Pro** (Scale's harder, contamination-resistant variant) tops out at ~**59%** (GPT-5.4 xHigh) with Claude Opus 4.6 at 51.9% [3][4]. **OSWorld** (computer-use): GPT-5.5 at **78.7%**, crossing the human baseline of ~72% [5].
- **Production deployment**: Cursor at **$2B ARR**, Cognition Devin merging hundreds of thousands of PRs across thousands of customers including Goldman Sachs and Infosys (in production at "millions of developers" scale via Infosys partnership), Claude Code at **$2.5B run-rate revenue**, Anthropic at $30B ARR overall with enterprise the bulk of that growth [6][7][8][9]. Klarna automated **67% of customer-service conversations** in month one, though they later re-introduced humans for ~5% of complex edge cases [10].

So as of mid-2026: the capability bar to cross 30% on at least one role is visibly within reach within 1–2 model generations (Claude Opus 5.0, GPT-6) and the deployment infrastructure is shipping fast. The lagging variable is **measurement** — nobody is formally benchmarking against O*NET — and **the long tail of tasks** that aren't represented in benchmarks but matter in the real job (status meetings, navigating internal politics, picking up half-articulated requirements over Slack).

## Key uncertainties

1. **Does the METR doubling rate hold or revert to 7-month historical mean?** If it reverts, week-long task capability slides from mid-2027 to ~early 2029, shifting the P50 by 1.5–2 years. The 4-month rate is based on only ~12 months of data and could be a regime artifact.
2. **What % of tasks in a "typical" job are within agent capability today but blocked by integration / context / data access?** If 40%, then enterprise plumbing builds-out alone (not capability progress) gets us to 30%-in-role by 2027. If 10%, capability is the binding constraint.
3. **Does the EU AI Act's December 2027 high-risk deadline functionally ban unsupervised agentic action in regulated domains (legal, finance, HR)?** A hard ban on autonomous decision-making would carve out 30–40% of knowledge-work roles in the EU and slow US adoption by precedent. The May 2026 omnibus delay suggests softening.
4. **Does training data run out in 2027–28 and stall scaling?** Synthetic data and RL-from-environment may compensate, but if not, the doubling rate breaks and P50 slides 3+ years.
5. **Will labor pushback (NLRB, state-level AI agent disclosure laws, sectoral unions) make 30%-autonomous a no-go in mainstream firms even if the tech works?** Microsoft, Cisco, Meta are doing it under "AI-driven efficiency" framing — pushback so far is rhetorical not regulatory.

## Evidence synthesis

### Academic

The single most important academic anchor is METR's *Measuring AI Ability to Complete Long Software Tasks* (Kwa et al., arXiv:2503.14499) [11], which introduces the time-horizon metric: the duration of human-expert task that the model can complete with 50% reliability. METR's longitudinal data shows a 7-month doubling from 2019–2025 (~9 seconds for GPT-3 in 2020 → ~50 minutes for Claude 3.7 in early 2025 → 14.5 hours for Claude Opus 4.6 in Feb 2026 under the TH 1.1 methodology). Since 2024, doubling has compressed to **~89 days** [2]. The implication: 1-week task capability arrives ~mid-2027 under continuation of the recent regime; ~mid-2028 under the long-run regime; 1-month task capability ~mid-2029. METR's extrapolation explicitly says "within 5 years, AI systems will be capable of automating many software tasks that currently take humans a month" [11].

On benchmark-specific results, the **SWE-bench family** (Princeton/OpenAI, agent code repair) and its harder variants (SWE-bench Pro [4], SWE-bench Multimodal) show the same exponential — but with explicit warnings that Verified saturation is partly contamination [3]. The **GAIA benchmark** (Meta AI Research) for general AI assistants shows frontier models at 74.6% (Claude Sonnet 4.5, scaffolded) vs human baseline of ~92% [12]. **OSWorld** for computer-use agents has now been crossed by GPT-5.5 (78.7% vs ~72% human) [5]. **τ²-Bench** (Sierra) measures customer-service agents on policy-adherent multi-turn flows — a much harder bar than "task complete" — and frontier models cluster in the 50–65% range as of April 2026 [13].

Citation-graph leaders in agent eval are Princeton (SWE-bench, HAL), Sierra (τ-bench), CMU+OSU (WebArena, OSWorld), and METR itself. The trajectory across all five benchmarks is consistent: rapid progress on narrow tasks (coding), slower on broad open-world tasks (GAIA, OSWorld), and the slowest on multi-turn-with-policy (τ-bench) — which is also the most predictive of real deployment success.

Anthropic's own **Project Vend** [14] is informative as a negative result: in mid-2025 Claude failed to profitably run a tiny vending business autonomously over a month; in Phase 2 (with multi-agent architecture) it improved meaningfully but the gap between "capable" and "completely robust" remained wide. This is good evidence that *agentic business operation* is harder than *agentic task completion*, and is why I don't think benchmark scores alone get us to a 30%-of-a-job claim.

### Industry / market

The deployment numbers from 2025–2026 are the strongest single piece of evidence that we are mid-crossing, not approaching, this gate. Five anchor points:

1. **Anthropic** crossed **$30B annualized revenue** by April 2026, up from $9B at end-2025 and $1B at end-2024 — an 80x growth in 16 months [9]. Claude Code alone is at $2.5B run-rate, with enterprise representing >50% of Claude Code revenue. >1,000 customers now spend >$1M annually with Anthropic, vs ~12 two years earlier.
2. **Cursor** hit **$2B ARR** by February 2026 from $100M one year earlier — used by ~70% of the Fortune 1000 [6]. Latest reporting puts Cursor in talks at a $50B valuation.
3. **Cognition / Devin**: PR merge rate up to **67%** in the 2025 performance review (from 34% the prior year); deployment at thousands of companies including Goldman Sachs, Santander, Dell, Cisco, Palantir; Infosys partnership for global financial-services rollout being "one of the largest agentic deployments to date" [7]. ARR went from $1M (Sep 2024) to $73M (Jun 2025). January 2026 launched **Devin Review** for automated code review.
4. **Replit** at $150M ARR (Sep 2025), on track for $1B run-rate by end-2026 [15]. **Lovable** at $200M ARR (Nov 2025), 8M users, 100K+ projects/day [15]. Both target the "non-engineer builds an app" segment which is itself a knowledge-work automation play.
5. **Klarna**: in customer support specifically, the 67% autonomous resolution rate (Feb 2024 launch month) achieved a 40% drop in cost-per-transaction over 2 years and replaced ~700 agents [10]. Note the 2025 walkback: Klarna re-introduced humans for the 5% of complex/emotional/edge-case conversations. This is the **clearest existing example** of crossing the 30% threshold in a real role today — though customer support is the *easiest* knowledge-work role to automate.

The **Anthropic Economic Index (March 2026)** [16] is the best public dataset on actual usage patterns: API traffic (which is more agentic / automated) shows the share of "computer and mathematical" tasks growing **+14%** over 6 months, and customer service is flagged as the occupational category with highest automation exposure. New API automation categories emerging in late 2025 include "business sales and outreach automation" and "automated trading and market operations" — both >2x growth. Critically, Anthropic notes that **automation already dominates 1P API traffic**, meaning that for the workloads developers ship to customers, the median interaction is already minimal-human-loop.

McKinsey Global Institute's 2023 estimate — reaffirmed in 2025 — pegs **up to 30% of US work hours automatable by 2030** with genAI [17]. For STEM specifically, McKinsey raises the 2030 automation potential from 14% to 30% under genAI scenarios. Forrester's 2024 estimate is more conservative (6% of US jobs eliminated by 2030), reflecting the distinction between *task automation* and *job elimination* that explains why my gate is well-defined: 30% of tasks within a role is a much earlier milestone than 30% of jobs gone.

### Public sentiment

r/cscareerquestions in May 2026 is dominated by layoff posts and AI-displacement narratives. The top post — "4 engineers now doing the job of 12 at my friend's company because AI agents handle the rest" — has 1,315 upvotes and 516 comments [18]. The framing in earnings calls Microsoft, Meta, Cloudflare, Cisco, Coinbase, Paypal is consistent: "AI-driven efficiencies" justifying headcount cuts. CS enrollment is dropping for the first time in 6 years while ME and EE rise 11%/14% [18]. A senior FAANG manager's hopeful counter-narrative ("the golden age is coming, you're not cooked") got 1,354 upvotes — but the top comments push back hard. Sentiment among working software engineers is **directly observable as bearish on entry-level survival, bullish on senior-with-AI productivity**. The "first rung of the ladder turns into a button, eventually there's no ladder" meme has 1,602 upvotes.

r/LocalLLaMA shows a different signal: builders are skeptical of frontier-only narratives. Qwen 3.6 27B replacing Opus 4.7 for "85% of tasks" [19] suggests open-weight models are catching up fast — meaning the agent stack is commoditizing, not concentrating. This is bullish for "30% of role" timelines because it implies cost-per-task is falling fast enough to make economically viable agent deployment practical even for low-margin roles.

r/Lawyertalk is the most informative for legal-as-a-knowledge-work-role. Top post: clients showed up to an estate-planning consult wearing Meta camera glasses, had AI analyze the meeting, then sent back an AI-generated critique recommending an offshore trust and undercutting the lawyer's fees [20]. The DOJ-brief-clearly-written-by-AI post had 1,266 upvotes. Lawyers are simultaneously contemptuous of AI legal output quality *and* aware it's pricing into client expectations. California Bar's May 2026 rule proposal requires verification of every AI output [21] — this is exactly the friction that delays "30% autonomous" in regulated knowledge work even when the capability is there.

r/singularity is, predictably, on the bullish extreme — the modal post in April–May 2026 is robot manufacturing acceleration, Atlas tricks, half-marathon records broken by robots, GPT-5.4 solving 60-year-old Erdős problems [22]. Sam Altman publicly walked back UBI advocacy ("I no longer believe in universal basic income as much as I once did") — bullish signal that even OpenAI's CEO no longer thinks the labor displacement will be smoothed by cash transfers, which is itself an admission of expected speed.

### Prediction markets

The directly relevant **Metaculus** question is *AI as a Competent Programmer Before 2030* [23] — community resolution implies a median probability close to **70–80%** of competent autonomous programming by end-2029 under reasonable interpretations. The associated Metaculus "AI 2027" tournament aggregates questions on automation of AI R&D, with community estimates that AI begins automating AI research by **2027** — directly upstream of my gate. The **Metaculus Labor Automation Forecasting Hub** [24] hosts multiple questions on hours automated; the modal community estimate is consistent with McKinsey's 30%-by-2030 number.

On **Manifold**, the cleanest question I found is *"Will AI cause the US Unemployment Rate to exceed 10% before 2030?"* [25] — current price around 15–20% probability (low for unemployment but high for the implied automation pressure). This is much lower than the "AI handles 30% of tasks" question because *task automation* doesn't translate 1:1 to *unemployment* — many displaced workers shift roles rather than become unemployed.

My P50 of 2029 sits **right at the median of the Metaculus AI-programmer-by-2030 implied resolution year**, slightly more aggressive than McKinsey's 30%-by-2030 (which is hours-weighted across all roles, not within-role), and substantially more bullish than Forrester's 6%-by-2030 (which is jobs-eliminated, a different metric). The market consensus has shifted bullish over the past 12 months as Devin/Cursor/Anthropic numbers have shipped — the gate I'm forecasting is now mostly priced in by 2030 in prediction markets but the date is still contested.

### Policy / regulation

The single most material near-term policy lever is the **EU AI Act**. The high-risk system compliance deadline was pushed from August 2026 to **December 2, 2027** by the May 2026 omnibus revision [26]. High-risk categories include: employment decisions, education, biometrics, critical infrastructure, migration/asylum, and credit scoring — i.e., a meaningful share of "knowledge-work decisions" that touch consequential outputs. The Act explicitly addresses **agentic AI**: agentic workflows must log events for risk identification, not just final outputs [26]. The functional effect for the gate I'm forecasting: a hard ceiling on what "autonomous without human review" can mean in EU regulated domains. A US firm can run Devin merging PRs unsupervised; a German bank cannot use the same Devin to make a credit decision unsupervised without breaking the AI Act.

The **California Bar** in May 2026 proposed a rule requiring lawyers to verify *every* AI output [21]. This is a direct constraint on the legal-AI variant of my gate — agents can be 30% of a lawyer's *task throughput* only if the rule allows it, which currently it does not without verification. ABA's task force in late 2025 declared AI "moved from experiment to infrastructure for the legal profession" [27], but their guidance is still oversight-heavy.

US executive orders under the current administration have been pro-deployment (rolled back the Biden EO's reporting requirements early 2025), so federal regulatory drag is low in the US. The bigger US risk is **state-level disclosure laws** (California, New York, Colorado) and NLRB rulings on automation-driven layoffs that could slow integration.

Professional associations beyond the bar are slower to formalize — accounting (AICPA), engineering (NSPE), medicine (AMA) are at the "guidance" stage, not binding rules. McKinsey itself is deploying thousands of AI agents internally [17], which functions as a market signal that consulting (a flagship knowledge-work role) is being automated by its own incumbent practitioners — bullish for fast crossing of the 30% threshold in white-collar professional services.

## Sub-gates (upstream)

The upstream dependencies that must be true for the gate to pass:

1. **METR 50%-reliability time horizon ≥ 1 week (40 hours)** — P50: 2028. METR is currently at 14.5h (Feb 2026), doubling every 4 months. Two doublings (~8 months) gets to 1 week. Slip risk: doubling rate reverts to 7-month historical mean → 14 months → mid-2027 baseline plus model release cadence.
2. **SWE-bench Pro ≥ 75%** — P50: 2027. The contamination-resistant benchmark currently tops at 59%. Two model generations should clear 75% under the current trajectory.
3. **τ²-Bench policy-adherent score ≥ 90%** — P50: 2028. Customer service is the only knowledge-work role where unsupervised deployment is already economically viable today, and τ-bench is the cleanest measure of "actually follows the rules without a human watching."
4. **Agent cost-per-task < $1 for 10-minute knowledge-work unit** — P50: 2027. Frontier model inference is dropping ~10x per year. Sub-$1 task economics is the unlock for "every white-collar role gets a baseline agent" rather than only top-tier roles.
5. **Long-context 1M-token reliable recall** — P50: 2026. This is the closest to passing already (Gemini 2.5 Pro, Claude with 1M context). What's missing is *reliable* recall across multi-day session memory, which is more about agent architecture than raw context.
6. **Multi-agent orchestration in production** — P50: 2027. Claude Code subagents, Devin's multi-agent architecture, and Project Vend Phase 2 all show this is possible but not yet a default deployment pattern.

## Cross-gate dependencies

The 30%-knowledge-work gate has the following non-trivial relationships with the other 10 gates in this set:

**Strongest dependency** — `ai-tutor-k8-parity-20mo`. Same underlying tech stack (long-horizon, tool-using, multi-turn LLM agents with retrieval). If a knowledge-work agent can handle 30% of a software engineer's tasks, a K-8 tutor at parity is fundamentally a packaging and safety-tuning problem, not a capability problem. **Relation: enables. Strength: strong.** A 6-month lag would be typical — knowledge-work agents reach 30% threshold, K-8 tutors at parity follow within a model generation.

**Medium correlation** — `autonomous-freight-delivery`. Both depend on long-horizon reliability and regulator acceptance of unsupervised autonomous action in consequential domains. The *capability* progress is mostly independent (one is mostly perception/control, one is mostly language/reasoning), but the *deployment timing* correlates because both run into the same compliance / liability infrastructure. **Relation: correlates. Strength: medium.**

**Weak correlation** — `robotaxi-unit-economics-5-cities` and `humanoid-retail-20k`. Different stacks technically, but both are *autonomous action gates* and the regulatory / labor-policy backlash applies broadly. If society broadly accepts "AI agents handling 30% of knowledge work," it's marginally easier to accept "robots stocking shelves" — but the binding constraints are very different. **Relation: correlates. Strength: weak.**

**Substitutes** — `construction-robot-40pct-labor`. If construction labor automation accelerates, the political/economic pressure to slow knowledge-work automation may rise as a labor-market protection response — though more likely both proceed in parallel. **Relation: weak.**

**Unrelated** — `cell-meat-beef-parity`, `residential-solar-storage-0.04`, `metals-bom-30pct`, `evtol-1k-trips-major-city`, `smr-first-oecd-deployment`. These are physical-world cost-curve gates that don't share a meaningful capability or policy bottleneck with knowledge-work agent autonomy.

## Downstream impact essay

**Labor (primary).** The 30%-tasks-autonomous threshold passing in any one knowledge-work role triggers a phase change in that role's labor economics within 24–36 months. Customer support has already crossed (Klarna, 67%) and the consequence has been: ~50% headcount reduction in supported teams, role redefinition toward "AI-assisted escalation specialist," and dropping wages for the residual humans. Software engineering is mid-cross today: the Reddit signal of "4 engineers doing the work of 12" matches the Microsoft/Meta/Cloudflare/Cisco/Coinbase/Paypal/Fidelity wave of 2026-Q2 layoffs framed as "AI-driven efficiency." If P50 = 2029 is right, by 2032 expect: (a) entry-level white-collar hiring at large firms dropping 40–60% from 2024 levels, (b) wage compression in the bottom-two quintiles of knowledge-work (back-office, junior analyst, paralegal, content moderator, basic legal review), (c) wage expansion in the top quintile for humans who can effectively orchestrate teams of agents. This is not "mass unemployment" — it's "the bottom rung of the career ladder evaporating" while senior workers get more leverage. The CS enrollment drop in 2026 is the early demographic signal. Political response: UBI is dead (Altman quietly walked back), what's likely instead is some mix of (a) reskilling tax credits, (b) sector-specific job guarantees, (c) "human in the loop" mandates in regulated industries.

**Education (secondary).** If by 2029 a typical software-dev / analyst / paralegal role is 30%+ automated, the K-12 → college pipeline reconfigures. The signal is already visible: CS enrollment drop, ME/EE rises. What kids actually need to learn changes: (a) **judgment and verification** — knowing when an agent's output is wrong, why, and how to fix it; (b) **agent orchestration** — being good at directing a team of agents is the new "being good at managing people"; (c) **deep domain knowledge** — generic prompt-engineering gets commoditized fast, but knowing a domain well enough to ask the right questions becomes the durable skill; (d) **physical / human-bound skills** — trades, healthcare, hands-on creative, hospitality. The kids who get a generic information-economy education in 2026 will graduate into 2030–2034, exactly when the 30%-threshold-in-mainstream-roles is biting hardest. Curricula should bias toward depth-in-domain + AI-leveraged output, not breadth in legacy white-collar skills.

**Travel (tertiary).** If knowledge-work agents handle 30% of tasks autonomously, the marginal value of a knowledge worker being physically co-located with their team falls — they're managing agents that operate 24/7 from anywhere, and the agent doesn't care which time zone the human is in. This *entrenches* remote work and weakens RTO mandates economically — though RTO is now driven by real-estate sunk cost and managerial preference, not productivity. The labor-market consequence: location decisions for skilled remote workers become tax-arbitrage decisions (Israel → Portugal, US → Mexico City, NYC → Miami). For Tel Aviv specifically, the Israeli tech sector becomes *more* attractive to globally-distributed talent because (a) remote-work normalization, (b) Israeli engineers were already accustomed to working with US clients at distance, (c) lower COL than US. The 2nd-order effect on travel: business travel for routine knowledge-work coordination drops further; but high-stakes deal-making, conferences, and leadership offsites get *more* valuable because they're the parts of work that agents can't do.

## Decision implications for Tamir

**At P10 (2027)**: the gate passes in customer support and at least one other role (likely SWE at certain enterprises). For Tamir specifically: the trading bot / Fiverr seller / arc-scout / smart-home portfolio is *already* the right bet — these are AI-leveraged products that benefit from cheaper / better agents and don't require Tamir's labor to scale. The implied move: lean harder into "product founder using agents as the engineering team" — the multiplier from cursor/claude-code/devin-style tooling means a solo founder can now run what was a 5-person company in 2024. The kids (6–10 in 2026) are 8–12 by 2027 — their education choices haven't crystallized yet, so the early signal should bias toward (a) keeping them deeply curious and adaptable, (b) NOT pushing them down the "become a paralegal / junior accountant / content writer" path, (c) emphasizing hands-on skills (instrument, sport, language, building things) that compound regardless of which knowledge-work roles automate. Wife's career: if she's in a knowledge-work role, the question is whether she's in the bottom-50% (automation-exposed) or top-30% (agent-orchestrator) tier of her field — and whether she can shift within the next 24 months.

**At P50 (2029)**: this is the planning scenario. The kids are 9–13. Plan as if by 2030 most entry-level white-collar paths are 50% smaller than today. Their best career bets at age 18 will look like: (a) trades that pay $80K+ and aren't automatable for a decade after — electrician, plumber, healthcare tech, (b) creative / human-bound work — therapy, hospitality, performance, (c) **agent-orchestrator roles in domains they deeply know** — but this requires building domain depth as teenagers, which means picking *something specific* and going deep on it. Tamir's own career: by 2029, the AI-product founder leverage is enormous but so is the competition (every solo founder has the same agents). The differentiator will be (a) taste / product judgment, (b) distribution, (c) a moat that isn't "we have engineers." Savings/investments: under-allocate to traditional white-collar-employer-dependent assets (US tech equities are mostly fine, but watch the labor-share narrative); over-allocate to compute infrastructure (NVDA, AMD, hyperscaler picks), agent/AI-tools layer (anthropic-adjacent investments if accessible), and real assets in cities that capture knowledge-work-remote-emigration flows (Israel, Portugal, parts of Mexico, Miami). Skills Tamir should invest in: agent orchestration depth (you're already doing this), system design judgment (since you're now responsible for what 5 engineers used to be), and one durable physical-world domain — pick one of healthcare, real estate, hospitality, or hardware where AI doesn't yet fully bite.

**At P90 (2034)**: a deep stall in capability or a hard regulatory clamp. In this world, the kids' education choices look more like 2010–2020 — knowledge work is still safe-ish, college matters, white-collar entry roles exist. This is the *safer* world to plan for in some respects but the *less likely* one given the current trajectory. The conservative hedge here is: keep enough optionality that if 2034 looks like "extended business as usual," the kids and you haven't catastrophically over-rotated. Concretely: don't pull kids out of school into trades at 14; do keep them in mainstream education while supplementing with the deep-domain and human-bound skills.

The most-useful single move from this analysis: **act as if P50 = 2029 in your decisions about your own products and career**, but **don't lock the kids' education into the P10 scenario** — leave them options. The asymmetry of being wrong is bigger if you over-rotate them toward the trades and the AI capability stalls than if you keep them in mainstream education and the AI capability accelerates (the latter case, agent orchestration and judgment skills are learnable in a 12-month bootcamp at 22, not a decade of teenager prep).

## Sources

1. [METR, *Measuring AI Ability to Complete Long Tasks*](https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/) — original 2025 paper introducing the 50%-time-horizon metric and 7-month doubling rate; Claude 3.7 Sonnet at ~50 min. Accessed 2026-05-13.
2. [METR, *Time Horizon 1.1*](https://metr.org/blog/2026-1-29-time-horizon-1-1/) — January 2026 update: Claude Opus 4.5 at 320 min, GPT-5 at 214 min, doubling since 2024 of 89 days. Accessed 2026-05-13.
3. [LLM-Stats SWE-Bench Verified leaderboard](https://llm-stats.com/benchmarks/swe-bench-verified) — GPT-5.5 88.7%, Claude Opus 4.7 87.6% as of April 2026; OpenAI's contamination warning. Accessed 2026-05-13.
4. [Scale AI SWE-Bench Pro public leaderboard](https://labs.scale.com/leaderboard/swe_bench_pro_public) — contamination-resistant variant; GPT-5.4 xHigh leads at 59.10%, Claude Opus 4.6 thinking at 51.9%. Accessed 2026-05-13.
5. [Coasty Blog OSWorld benchmark results 2026](https://coasty.ai/blog/osworld-benchmark-results-2026-computer-use-ranked) — GPT-5.5 78.7%, Claude Opus 4.6 72.7%, Claude Sonnet 4.6 72.5%, human baseline ~72%. Accessed 2026-05-13.
6. [TechBuzz, Cursor Hits $2B ARR](https://www.techbuzz.ai/articles/cursor-hits-2b-arr-doubles-revenue-in-just-3-months) — $100M Jan 2025 → $2B Feb 2026; 70% of Fortune 1000 customers; 1M+ DAU. Accessed 2026-05-13.
7. [Cognition, *Devin's 2025 Performance Review*](https://cognition.ai/blog/devin-annual-performance-review-2025) — 67% PR merge rate (up from 34%); deployment at Goldman, Santander, Dell, Cisco; Infosys partnership for global deployment. Accessed 2026-05-13.
8. [SiliconANGLE, Cognition $25B valuation talks](https://siliconangle.com/2026/04/23/cognition-creator-ai-software-engineer-devin-talks-raise-hundreds-millions-25b-valuation/) — ARR $1M Sep 2024 → $73M Jun 2025; product expanded to enterprise IDE + code review. Accessed 2026-05-13.
9. [VentureBeat, Anthropic $30B revenue run-rate](https://venturebeat.com/technology/anthropic-says-it-hit-a-30-billion-revenue-run-rate-after-crazy-80x-growth) — Claude Code $2.5B run-rate, >1,000 customers spending >$1M annually, 80x growth. Accessed 2026-05-13.
10. [Klarna press release, *AI assistant handles two-thirds of customer service chats in its first month*](https://www.klarna.com/international/press/klarna-ai-assistant-handles-two-thirds-of-customer-service-chats-in-its-first-month/) — 67% autonomous resolution, 2.3M chats in month one, $40M/yr saved, with later partial walkback to human-hybrid. Accessed 2026-05-13.
11. [Kwa et al., arXiv:2503.14499, *Measuring AI Ability to Complete Long Software Tasks*](https://arxiv.org/abs/2503.14499) — methodology, 7-month doubling, 5-year extrapolation to month-long tasks. Accessed 2026-05-13.
12. [HAL Princeton GAIA leaderboard](https://hal.cs.princeton.edu/gaia) — Claude Sonnet 4.5 at 74.6% (scaffolded); human baseline ~92%; Anthropic sweeps top 6. Accessed 2026-05-13.
13. [Sierra, τ²-Bench leaderboard via Artificial Analysis](https://artificialanalysis.ai/evaluations/tau2-bench) — policy-adherent customer-service eval; frontier models cluster 50–65% as of April 2026. Accessed 2026-05-13.
14. [Anthropic, *Project Vend Phase 2*](https://www.anthropic.com/research/project-vend-2) — Claude running an actual vending shop; Phase 1 failed economically, Phase 2 improved with multi-agent architecture but "gap between capable and completely robust remains wide." Accessed 2026-05-13.
15. [Sacra, Cursor / Replit / Lovable revenue tracking](https://sacra.com/c/cursor/) — Replit $150M ARR Sep 2025 → $1B target end-2026; Lovable $200M ARR Nov 2025 from $100M in 8 months. Accessed 2026-05-13.
16. [Anthropic Economic Index, March 2026 report](https://www.anthropic.com/research/economic-index-march-2026-report) — automation dominant in 1P API traffic; +14% computer/math tasks 6 months; customer service highest exposure; new "automated trading" and "sales outreach" categories 2x+. Accessed 2026-05-13.
17. [McKinsey Global Institute, *Generative AI and the Future of Work in America*](https://www.mckinsey.com/mgi/our-research/generative-ai-and-the-future-of-work-in-america) — up to 30% of US hours automatable by 2030 with genAI; STEM jumps from 14% to 30%; 12M job switches. Accessed 2026-05-13.
18. [r/cscareerquestions top posts, May 2026](https://www.reddit.com/r/cscareerquestions/comments/1tb026z/4_engineers_now_doing_the_job_of_12_at_my_friends/) — representative post "4 engineers doing the work of 12"; CS enrollment drop; Microsoft/Cisco/Meta layoff wave framed as AI efficiency. Accessed 2026-05-13.
19. [r/LocalLLaMA, *Kimi K2.6 is a legit Opus 4.7 replacement*](https://www.reddit.com/r/LocalLLaMA/comments/1sr8p49/kimi_k26_is_a_legit_opus_47_replacement/) — open-weight catching up on ~85% of frontier tasks; commoditization of agent stack accelerating. Accessed 2026-05-13.
20. [r/Lawyertalk, *Clients wore meta camera glasses to our consult then had AI analyze it*](https://www.reddit.com/r/Lawyertalk/comments/1t6ssgo/clients_wore_meta_camera_glasses_to_our_consult/) — AI analysis pricing into client expectations for legal services. Accessed 2026-05-13.
21. [LawSites, *California Bar Proposes Rule Requiring Lawyers to Verify Every AI Output*](https://www.lawnext.com/2026/05/california-bar-proposes-rule-requiring-lawyers-to-verify-every-ai-output-and-five-other-ai-focused-ethics-changes.html) — May 2026 ethics rule explicitly addressing agentic AI; verification mandatory. Accessed 2026-05-13.
22. [r/singularity top posts, April–May 2026](https://www.reddit.com/r/singularity/comments/1t14fpg/sam_altman_no_longer_believes_in_universal_basic/) — Altman walks back UBI, half-marathon broken by robot, Figure AI 24x production scale; bullish-sentiment baseline. Accessed 2026-05-13.
23. [Metaculus, *AI as a Competent Programmer Before 2030*](https://www.metaculus.com/questions/11188/ai-as-a-competent-programmer-before-2030/) — community implied resolution close to ~70–80% by 2030; closely related to upstream of this gate. Accessed 2026-05-13.
24. [Metaculus Labor Automation Forecasting Hub](https://www.metaculus.com/labor-hub/) — collection of related forecasting questions on hours automated, jobs displaced; modal community estimate aligns with McKinsey 30%-by-2030. Accessed 2026-05-13.
25. [Manifold, *Will AI cause the US Unemployment Rate to exceed 10% before 2030?*](https://manifold.markets/ahalekelly/will-ai-cause-the-us-unemployment-r) — current price 15–20% probability; reflects market view that task automation ≠ mass unemployment. Accessed 2026-05-13.
26. [Trilateral Research, *EU AI Act Compliance Timeline 2025–2027*](https://trilateralresearch.com/responsible-ai/eu-ai-act-implementation-timeline-mapping-your-models-to-the-new-risk-tiers) — high-risk deadline pushed to Dec 2, 2027; agentic AI logging requirements. Accessed 2026-05-13.
27. [LawSites, *ABA Task Force: AI Has Moved From Experiment to Infrastructure*](https://www.lawnext.com/2025/12/aba-task-force-ai-has-moved-from-experiment-to-infrastructure-for-the-legal-profession.html) — late-2025 ABA report on legal AI institutional status. Accessed 2026-05-13.