Writing Humanity’s Last Exam

When every frontier AI model can pass your tests, how do you figure out which model is best? You write a harder test.

That was the idea behind Humanity’s Last Exam, an effort by Scale AI and the Center for AI Safety to develop a large database of PhD-level questions that the best AI models still get wrong.

The effort has proven popular- the paper summarizing it has already been cited 91 times since its release on March 31st, and the main AI labs have been testing their new models on the exam. xAI announced today that its new Grok 4 model has the highest score yet on the exam, 44.4%.

Current leaderboard on the Humanity’s Last Exam site, not yet showing Grok 4

The process of creating the dataset is a fascinating example of a distributed academic mega-project, something that is becoming a trend that has also been important in efforts to replicate previous research. The organizers of Humanity’s Last Exam let anyone submit a question for their dataset, offering co-authorship to anyone whose question they accepted, and cash prizes to those who had the best questions accepted. In the end they wound up with just over 1000 coauthors on the paper (including yours truly as one very minor contributor), and gave out $500,000 to contributors of the very best questions (not me), which seemed incredibly generous until Scale AI sold a 49% stake in their company to Meta for $14.8 billion in June.

Source: Figure 4 of the paper

Here’s what I learned in the process of trying to stump the AIs and get questions accepted into this dataset:

  1. The AIs were harder than I expected to stump because they used frontier models rather than the free-tier models I was used to using on my own. If you think AI can’t answer your question, try a newer model
  2. It was common for me to try a question that several models would get wrong, but at least one would still get right. For me this was annoying because questions could only be accepted if every model got them wrong. But of course if you want to get a correct answer, this means trying more models is good, even if they are all in the same tier. If you can’t tell what a correct answer looks like and your question is important, make sure to try several models and see if they give different answers
  3. Top models are now quite good at interpreting regression results, even when you try to give them unusually tricky tables
  4. AI still has weird weaknesses and blind spots; it can outperform PhDs in the relevant field on one question, then do worse than 3rd graders on the next. This exam specifically wanted PhD-level questions, where a typical undergrad not only couldn’t answer the question, but probably couldn’t even understand what was being asked. But it specifically excluded “simple trick questions”, “straightforward calculation/computation questions”, and questions “easily answerable by everyday people”, even if all the AIs got them wrong. My son had the idea to ask them to calculate hyperfactorials; we found some relatively low numbers that stumped all the AI models, but the human judges ruled that our question was too simple to count. On a question I did get accepted, I included an explanation for the human judges of why I thought it wasn’t too simple.

I found this to be a great opportunity to observe the strengths and weaknesses of frontier models, and to get my name on an important paper. While the AI field is being driven primarily by the people with the chops to code frontier models, economists still have lot we can contribute here, as Joy has shown. Any economist looking for the next way to contribute here should check out Anthropic’s new Economic Futures Program.

The Ugly Gray Rhino Gathers Speed

A black swan is a crisis that comes out of nowhere. A gray rhino, by contrast, is a problem we have known about for a long time, but can’t or won’t stop, that will at some point crash into a full-blown crisis.

The US national debt is a classic gray rhino. The problem has slowly been getting worse for 25 years, but the crisis still seems far enough off that almost no one wants to incur real costs today to solve the problem. During the 2007-2009 financial crisis and the 2020-2021 Covid pandemic we had good reasons to run deficits. But we’ve ignored the Keynesian solution of paying back the deficits incurred in bad times with surpluses in good times.

We are currently in reasonably good economic times, but about to pass a mega-spending bill that blows the deficit up from its already-too-high-levels. At a time when we should be running a surplus, we are instead running a deficit around 6% of GDP:

Source: Congressional Budget Office

Our ‘primary deficit’ is lower, a more manageable 3% of GDP. But if interest rates go higher, either for structural reasons or because of a loss of confidence in the US government’s willingness to pay its debts, the total deficit could spiral higher rapidly. The CBO optimistically assumed that the interest rate on 10-year treasuries will fall below 4% in the 2030s, from 4.3% today:

Source: Congressional Budget Office

But their scoring of H.R. 1 (“One Big Beautiful Bill Act”) shows it adding $3 trillion to the debt over the next 10 years, increasing the deficit by ~1% of GDP per year.

I already suspected this gray rhino would eventually cause a crisis, but this bill and the milieu that produced turn it into a near guarantee- nothing stops the deficit train until we hit a full blown crisis. That crisis is no longer just a long-term issue for your kids and grandkids to worry about- you will see it in 7 years or so. Unfortunately, that is still far enough away that current politicians have no incentive to take costly steps to avoid it. In fact, deficits will probably make the economy stronger for a year or two before they start making things worse- convenient for all the Congresspeople up for election in less than 2 years.

Here are the ways I see this playing out, from most to least likely:

  1. By around 2032, either the slowly aging population or a sudden spike in interest rates forces the government to touch at least one of the third rails of American politics: cut Social Security, cut Medicare, or substantially raise taxes on the middle class (explicitly or through inflation).
  2. We get bailed out again by God’s Special Providence for fools, drunks, and the United States of America. AI brings productivity miracles bigger than those of computers and the internet, letting GDP grow faster than our debts.
  3. We default on the national debt (but this is a risky option because we will still want to run big deficits, and lenders will only lend if they expect to get paid back).
  4. We do all the smart policy reforms that economists recommend in time to head off the crisis and stop the rhino. Medical spending falls without important services being cut thanks to supply-side reforms or cheap miracle drugs (GLP-1s going off patent?).

I’m hoping of course for numbers 2 and 4, but after this bill I’m expecting the rhino.

Excluding “Non-Excludable” Goods

Intro microeconomics classes teach that some goods are “non-excludable”, meaning that people who don’t pay for them can’t be stopped from using them. This can lead to a “tragedy of the commons”, where the good gets overused because people don’t personally bear the cost of using it and don’t care about the costs they impose on others. Overgrazing land and overfishing the seas are classic examples.

Source: Microeconomics, by Michael Parkin

Students sometimes get the impression that “excludability” is an inherent property of a good. But in fact, which goods are excludable is a function of laws, customs, and technologies, and these can change over time. Land might be legally non-excludable (and so over-grazed) when it is held in common, but become excludable when the land is privatized or when barbed wire makes enclosing it cheap. Over time, such changes have turned over-grazing into a relatively minor issue.

Overfishing remains a major problem, but this could be starting to change. Legal and technological changes have allowed for enclosed, private aquaculture on some coasts, which provide a large and growing share of all fish eaten by humans. Permitting systems put limits on catches in many countries’ waters, though the high seas remain a true tragedy of the commons for now.

While countries have tried to enforce limits on catches in their national waters, monitoring how many fish every boat is taking has been challenging, so illegal overfishing has remained widespread. But technology is in the process of changing this. For instance, ThayerMahan is developing hydrophone arrays that use sound to track boats:

Technologies like hydrophones and satellites, if used well, will increasingly make public waters more “excludable” and reduce “tragedy of the commons” overfishing.

LIFE Survey Comes Alive

Last year I posted that the Philly Fed had started a new quarterly survey on Labor, Income, Finances, and Expectations (LIFE). I thought it looked promising but had yet to achieve its potential:

It will be interesting to see if this ends up taking a place in the set of Fed surveys that are always driving economic discussions, like the Survey of Consumer Finances and the Survey of Professional Forecasters. If they keep it up and start putting out some graphics to summarize it, I think it will. My quick impression (not yet having spoken to Fed people about it) is that it will be the “quick hit” version of the Survey of Consumer Finances. It asks a smaller set of questions on somewhat similar topics, but is released quickly after each quarter instead of slowly after each year. If they stick with the survey it will get more useful over time, as there is more of a baseline to compare to.

But a year later the survey now has what I hoped for: a solid baseline for comparisons, and pre-made graphics to summarize the results. It continues to show complex and mixed economic performance in the US. People think the economy is getting worse:

They are cutting discretionary (but not necessity) spending at record levels:

They are worried about losing their jobs at record levels:

But key areas like housing, childcare, and transportation are stabilizing:

Overall I think we can synthesize these seemingly contradictory pictures by saying that Americans’ finances are fine now, but they are quite worried that things are about to get worse, perhaps due to the tariffs taking effect. You can find the rest of the LIFE survey results (including all the non-record-setting ones) here.

The End of Easy Student Loans

The Senate Health, Education, Labor and Pensions Committee is proposing to cut off student loans for programs whose graduates earn less than the median high school graduate. The House proposed a risk-sharing model where colleges would partly pay back the federal government when their students fail to pay back loans themselves. Both the House and Senate propose to cap how much students can borrow for graduate loans. Both would reduce federal spending on higher ed by about $30-$35 billion per year, cutting the size of the $700 billion higher ed sector by 4-5%. I expected that something like this would happen eventually, especially after the student loan forgiveness proposals of 2022:

While we aren’t getting real reform now, I do think forgiveness makes it more likely that we’ll see reform in the next few years. What could that look like?

The Department of Education should raise its standards and stop offering loans to programs with high default rates or bad student outcomes. This should include not just fly-by-night colleges, but sketchy masters degree programs at prestigious schools.

Colleges should also share responsibility when they consistently saddle students with debt but don’t actually improve students’ prospects enough to be able to pay it back. Economists have put a lot of thought into how to do this in a manner that doesn’t penalize colleges simply for trying to teach less-prepared students.

I’d bet that some reform along these lines happens in the 2020’s, just like the bank bailouts of 2008 led to the Dodd-Frank reform of 2010 to try to prevent future bailouts. The big question is, will this be a pragmatic bipartisan reform to curb the worst offenders, or a Republican effort to substantially reduce the amount of money flowing to a higher ed sector they increasingly dislike?

Of course, there is a lot riding on the details. How exactly do you calculate the income of graduates of a program compared to high school grads? The Senate proposal explains their approach starting on page 58. They want to compare the median income of working students 4 years after leaving their program (whether they graduated or dropped out, but exempting those in grad school) to the median income of those with only a high school diploma who are age 25-34, working, and not in school.

Nationally I calculate that this would make for a floor of $31,000. That is, the median student who is 4 years out from your program and is working should be earning at least $31k. In practice the bill would implement a different number for each state. This seems like a low bar in general, though you could certainly quibble with it. For instance, those 4 years out from a program may be closer to age 25 than age 34, but income typically rises with age during those years. If you compare them to 26 year old high school grads, the national bar would be just $28k.

What sorts of programs have graduates making less than $31k per year?

Continue reading

The Average Teaching Load of US Professors

“One of the closest guarded secrets in American higher education is the average teaching loads of faculty.” -Richard Vedder

I saw this quote in a recent piece arguing that US professors should teach more. I thought it sounded extreme, but as I look into it, it is surprisingly difficult to find data on this compared to other things like salaries:

Since 1996, for instance, the University of Delaware has administered the annual National Study of Instructional Costs and Productivity, surveying faculty and teaching assistants about course loads and enrollment. The data, though, are “only available to four-year, non-profit institutions of higher education.”[7] This secrecy, needless to say, is not the norm for surveys collected by publicly supported institutions. Tellingly, this study is being discontinued because the number of participating institutions “has slowly declined to unsustainable levels.”[8] *

There are some decent older studies that are public, like this 2005 survey of top liberal arts colleges showing that almost all have teaching loads between 4 and 6 courses per year. But in terms of recent data that is publicly available, the best I’ve found is the Faculty Survey of Student Engagement. It still isn’t great, since their 2024 survey only covers 54 of the 2000+ bachelor’s degree granting colleges in the US, and their tables show that these 54 aren’t especially representative. They make nice graphics though:

The graphics show exact percentages if you hover over them on the original Tableau site. Doing this shows that the median professor teaches 4 undergraduate courses per year. Knowing the full distribution would require the underlying data they don’t share, but from these graphics we can at least compute a rough average (rounding 4+ graduate courses to 4 and 9+ undergraduate courses to 9).

This shows that the average professor teaches 4.43 undergraduate courses and 0.75 graduate courses, for a total of 5.18 courses per academic year. If I restrict the data to full-time tenured or tenure-track professors, they teach an average of 4.72 undergraduate courses and 0.91 graduate courses, for a total of 5.63 courses per academic year.

Overall these loads are higher than I expected, especially since the survey sample is skewed towards research schools. But its still lower than the standard 3-3 load at my own institution, and low enough that it makes for a great job, especially compared to teaching K-12.

Overall though I don’t know why we need to rely on one-off surveys to get data on teaching loads, it seems like data the US Department of Education should collect from all accredited schools and share publicly.

*The Delaware Cost study is not just discontinuing new surveys, they plan to pull down existing data by December 15th 2025. Only schools that participate in their survey get access, so I can’t get the data, but perhaps some of you can.

Queens 2060: Where Upzoning Matters Most

Most US cities make it hard for housing supply to meet demand because of rules that prevent large apartment buildings. Usually cities do this with zoning rules that limit the number of homes per parcel, often to as low as 1. New York City relies more on rules about Floor Area Ratio (the ratio of the floor area to the area of the parcel). But how binding are these rules? If we relaxed or repealed them, how much new construction would we see, and where would we see it?

MIT PhD student Vincent Rollet has calculated this for New York City:

I build a dynamic general equilibrium model of the supply and demand of floorspace in a city , which I estimate using a novel parcel-level panel dataset of land use and zoning in New York City. I validate the model using quasi-experimental variation from recent zoning reforms and use it to simulate the effects of zoning changes on construction and prices.

He finds that eliminating these rules in NYC would lead to a construction boom, with a 79% increase in the amount of floor space available by 2060. This would allow many more people to live in New York, with a 52% increase in population; but many of the benefits would go to existing NYC residents, with more floor space per person and modestly lower rents leading to higher wellbeing:

Where exactly would we see the building boom? Not Manhattan, but Brooklyn and Queens. The intuition is that zoning is most binding in places where housing prices are currently high but where the buildings are currently small; this is where there is the biggest incentive to tear down existing buildings and build taller if you are allowed to.

Corporate Debt by Industry Sector

A reporter recently told me she thought there is a national trend toward hospitals issuing more bonds. I tried to verify this and found it surprising hard to do with publicly available data. But once I had to spend an hour digging through private Compustat data to find the answer, I figured I should share some results. Here’s the average debt in millions of companies by sector:

Source: My graph made from Compustat North American Fundamentals Annual data collapsed by Standard Industrial Classification code into the Fama-French 10 sectors

This shows that health care is actually the least-indebted sector, and telecommunications the most indebted, followed by utilities and “other” (a broad category that actually covers most firms in the Fama-French 10). But are health care firms really more conservative about debt, or are they just smaller? Let’s scale the debt by showing it as a share of revenue:

My graph made from Compustat North American Fundamentals Annual data collapsed by SIC code into the Fama-French 10 sectors (dltt/revt).

It appears that health care firms are the most indebted relative to revenue since 2023. But which parts of health care are driving this?

Hospitals in 2023 followed by specialty outpatient in 2024. However, seeing how much the numbers bounce around from year to year, I suspect they are driven by small numbers of outlier firms. This could be because Compustat North America data only covers publicly traded firms, but many sectors of health care are dominated by private corporations or non-profits.

I welcome suggestions for datasets on the bond-market side of things that are able to do industry splits including private companies, or suggestions for other breakdowns you’d like to see me do with Compustat.

The Most Regulated States

The Mercatus Center has put together a page of “Snapshots of State Regulation” using data from their State RegData project. Their latest data suggests that population is still a big predictor of state-level regulation, on top of the red/blue dynamics people expect:

They also made pages with much more detail on each state, like what the most regulated industries in each state are and how each one compares to the national average:

You can find your state here.

“How Can the US Manufacture More” Is a Reasonable Question That Deserves Reasonable Answers

Many regular Americans and policymakers say they want the US to manufacture more things domestically. But when they ask economists how to accomplish this, I find that our most common response is to question their premise- to say the US already manufactures plenty, or that there is nothing special about manufacturing. It’s easy for people to round off this answer to ‘your question is dumb and you are dumb’, then go ask someone else who will give them a real answer, even if that real answer is wrong.

Economists tell our students in intro classes that we focus on positive economics, not normative- that we won’t tell you what your goals should be, just how best to accomplish them. But then we seem to forget all that when it comes to manufacturing. Normally we would take even unreasonable questions seriously; but I think wondering how to increase manufacturing output is reasonable given the national defense externalities.

So if you had to increase the value of total US manufacturing output- if you were going to be paid based on a fraction of real US manufacturing output 10 years from now- how would you do it?

I haven’t made a deep study of this, but here are my thoughts. Better ideas at the top, ‘costly but would increase manufacturing output’ ideas at the bottom:

Continue reading