GDP Predictions: Pretty Good!

Last week I wrote about the GDP predictions from Kalshi and the GDPNow Model. They were both showing 2.4% for Q2 of 2025 last week. They both changed slightly by yesterday, up to 2.8% and 2.9%. The final result (technically, the “advanced” result, but the final one for purposes of this comparison) was 2.97%. The Atlanta Fed GDPNow model continues to be a top performer, and you can’t do much better than averaging these two estimates. And you can pretty consistently do better than the median result from the WSJ/Dow Jones survey of economists.

Here’s the updated table:

And here is the original post explaining the data.

Is A Music Major Worth It?

Our new paper concludes that the answer is a resounding “It Depends”.

It depends on your answer to the following questions:

  1. If you didn’t major in music, would you major in something else, or not finish college?
  2. How dead set are you on a career in music?
Source: Figure 1 of Bailey and Smith (2025)

We found that

  1. Music majors earn more than people who didn’t graduate from college, even if they don’t end up working as musicians
  2. Among musicians, music majors earn more than other majors
  3. But among non-musicians, other majors earn much more than music majors

So on average a music major means higher income if you would be a musician anyway, or if you wouldn’t have gone to college for another major, but lower income than if you majored in something else and worked outside of music. The exact amounts depend on what you control for; this gets complex but this table gives the basic averages before controls:

Source: Table 2 of Bailey and Smith (2025), showing wage plus business income for respondents to the 2018-2022 American Community Survey

For better or worse, a music major also means you are much more likely to be a musician- 113 times more likely, in fact (this is just the correlation, we’re not randomizing people into the major). Despite that incredible correlation, only 9.8% music majors report being professional musicians, and only 22.3% of working musicians were music majors.

Sean Smith had the idea for this paper and wrote the first draft in my Economics Senior Capstone class in 2024. After he graduated I joined the paper as a coauthor to get it ready for journals, and it was accepted at SN Social Sciences last week. We share the data and code for the paper here.

Continue reading

Second Quarter GDP Predictions

Back in April I wrote about 4 different estimates of GDP growth and how well they have performed since 2023. With the 2nd quarter of 2025 GDP data coming out next week, what do the best performing predictors currently say?

In that last post, I showed that the Atlanta Fed GDPNow model and the Kalshi betting market were generally the best performers. And furthermore, averaging these two improves the predictive power a little more. As of today, the GDPNow model is predicting 2.4% growth and Kalshi is… also predicting 2.4%!

There will be a few more updates to GDPNow over the next week, and of course Kalshi is constantly updating as more people bet. But as of right now, 2.4% growth seems like a reasonable prediction. That may surprise some people, especially given all of the pessimism surrounding tariffs and policy uncertainty generally. But despite all of this, the US economy appears to be just continuing to chug along.

Inflation Is Stuck

Here’s a somewhat niche measure of inflation: 6-month CPI excluding food, shelter, and energy. It might seem like a weird measure, as it excludes over half of the CPI. But there is a logic to at least considering it along with other measures.

Food and energy are both volatile, so they can give us a lot of noise. That’s why “core CPI” and other core measures are followed closely by the Fed and inflation watchers. But excluding shelter might also make sense, because increasing housing prices are largely due to supply constraints, and will move independently of monetary policy to some extent. Six-month inflation is also useful for a more timely measure than 12 months, the headline number.

As you can see in the chart above, this niche measure of inflation has been stuck for two and a half years. It has oscillated between about 0.5% and 1.5% since December 2022. And right now it’s almost exactly in the middle of that range. It has come down from 6 months ago, but higher than 1 year ago.

As you can see in the pre-2020 years, it generally oscillated between 0% and 1%. So 6-month inflation is stuck about 0.5% higher than we had become used to, which translates into roughly 1% higher annually.

In the grand scheme of things, 1% higher inflation isn’t the end of the world. But we do seem to be stuck at a slightly elevated rate of inflation relative to the decade before 2020.

Writing Humanity’s Last Exam

When every frontier AI model can pass your tests, how do you figure out which model is best? You write a harder test.

That was the idea behind Humanity’s Last Exam, an effort by Scale AI and the Center for AI Safety to develop a large database of PhD-level questions that the best AI models still get wrong.

The effort has proven popular- the paper summarizing it has already been cited 91 times since its release on March 31st, and the main AI labs have been testing their new models on the exam. xAI announced today that its new Grok 4 model has the highest score yet on the exam, 44.4%.

Current leaderboard on the Humanity’s Last Exam site, not yet showing Grok 4

The process of creating the dataset is a fascinating example of a distributed academic mega-project, something that is becoming a trend that has also been important in efforts to replicate previous research. The organizers of Humanity’s Last Exam let anyone submit a question for their dataset, offering co-authorship to anyone whose question they accepted, and cash prizes to those who had the best questions accepted. In the end they wound up with just over 1000 coauthors on the paper (including yours truly as one very minor contributor), and gave out $500,000 to contributors of the very best questions (not me), which seemed incredibly generous until Scale AI sold a 49% stake in their company to Meta for $14.8 billion in June.

Source: Figure 4 of the paper

Here’s what I learned in the process of trying to stump the AIs and get questions accepted into this dataset:

  1. The AIs were harder than I expected to stump because they used frontier models rather than the free-tier models I was used to using on my own. If you think AI can’t answer your question, try a newer model
  2. It was common for me to try a question that several models would get wrong, but at least one would still get right. For me this was annoying because questions could only be accepted if every model got them wrong. But of course if you want to get a correct answer, this means trying more models is good, even if they are all in the same tier. If you can’t tell what a correct answer looks like and your question is important, make sure to try several models and see if they give different answers
  3. Top models are now quite good at interpreting regression results, even when you try to give them unusually tricky tables
  4. AI still has weird weaknesses and blind spots; it can outperform PhDs in the relevant field on one question, then do worse than 3rd graders on the next. This exam specifically wanted PhD-level questions, where a typical undergrad not only couldn’t answer the question, but probably couldn’t even understand what was being asked. But it specifically excluded “simple trick questions”, “straightforward calculation/computation questions”, and questions “easily answerable by everyday people”, even if all the AIs got them wrong. My son had the idea to ask them to calculate hyperfactorials; we found some relatively low numbers that stumped all the AI models, but the human judges ruled that our question was too simple to count. On a question I did get accepted, I included an explanation for the human judges of why I thought it wasn’t too simple.

I found this to be a great opportunity to observe the strengths and weaknesses of frontier models, and to get my name on an important paper. While the AI field is being driven primarily by the people with the chops to code frontier models, economists still have lot we can contribute here, as Joy has shown. Any economist looking for the next way to contribute here should check out Anthropic’s new Economic Futures Program.

23 MSAs Produce Half of US GDP

The 23 blue-shaded MSAs in this map produce half of US GDP:

You might be tempted to think this map, like so many maps, is just a map of US population. It kind of is, but not completely. These 23 MSAs have 133 million people (as of the 2020 Census), or about 40% of the US population. That’s a lot, but it’s much less than half, which the GDP proportion they account for. In other words, these MSAs also tend to have above-average per capita income.

The three largest MSAs by population (NY, LA, Chicago) are also the three largest by GDP. But after the first three there are some interesting discrepancies. The San Francisco MSA is the 4th largest by GDP, but only the 12th largest by population — San Fran has a population similar to the Phoenix MSA, but almost double the GDP. San Francisco MSA has a very high GDP per capita (the third highest).

The San Jose MSA is also among these 23 largest MSAs for GDP, and also sticks out — it is the 13th largest by total GDP, but only the 36th largest by population. San Jose has a population similar to Cleveland and Nashville, but well over double the GDP of these two MSAs individually. In fact, there are 12 MSAs larger in population than San Jose, but that aren’t among these 23 MSAs that produce half of US GDP: places like St. Louis, Orlando, San Antonio, Pittsburgh, and Columbus. Silicon Valley really pulls up San Jose: it has the 2nd largest GDP per capita among MSAs, only beaten by much smaller Midland, Texas and its oil income.

Here is the full list of those 23 MSAs:

  1. New York-Newark-Jersey City, NY-NJ-PA
  2. Los Angeles-Long Beach-Anaheim, CA
  3. Chicago-Naperville-Elgin, IL-IN-WI
  4. San Francisco-Oakland-Berkeley, CA
  5. Dallas-Fort Worth-Arlington, TX
  6. Washington-Arlington-Alexandria, DC-VA-MD-WV
  7. Houston-The Woodlands-Sugar Land, TX
  8. Boston-Cambridge-Newton, MA-NH
  9. Seattle-Tacoma-Bellevue, WA
  10. Atlanta-Sandy Springs-Alpharetta, GA
  11. Philadelphia-Camden-Wilmington, PA-NJ-DE-MD
  12. Miami-Fort Lauderdale-Pompano Beach, FL
  13. San Jose-Sunnyvale-Santa Clara, CA
  14. Phoenix-Mesa-Chandler, AZ
  15. Minneapolis-St. Paul-Bloomington, MN-WI
  16. Detroit-Warren-Dearborn, MI
  17. San Diego-Chula Vista-Carlsbad, CA
  18. Denver-Aurora-Lakewood, CO
  19. Baltimore-Columbia-Towson, MD
  20. Austin-Round Rock-Georgetown, TX
  21. Charlotte-Concord-Gastonia, NC-SC
  22. Riverside-San Bernardino-Ontario, CA
  23. Tampa-St. Petersburg-Clearwater, FL

Counting Hallucinations by Web-Enabled LLMs

In 2023, we gathered the data for what became “ChatGPT Hallucinates Nonexistent Citations: Evidence from Economics.” Since then, LLM use has increased. A 2025 survey from Elon University estimates that half of Americans now use LLMs. In the Spring of 2025, we used the same prompts, based on the JEL categories, to obtain a comprehensive set of responses from LLMs about topics in economics.

Our new report on the state of citations is available at SSRN: “LLM Hallucination of Citations in Economics Persists with Web-Enabled Models

What did we find? Would you expect the models to have improved since 2023? LLMs have gotten better and are passing ever more of what used to be considered difficult tests. (Remember the Turing Test? Anyone?) ChatGPT can pass the bar exam for new lawyers. And yet, if you ask ChatGPT to write a document in the capacity of a lawyer, it will keep making the mistake of hallucinating fake references. Hence, we keep seeing headlines like, “A Utah lawyer was punished for filing a brief with ‘fake precedent’ made up by artificial intelligence

What we call GPT-4o WS (Web Search) in the figure below was queried in April 2025. This “web-enabled” language model is enhanced with real-time internet access, allowing it to retrieve up-to-date information rather than relying solely on static training data. This means it can answer questions about current events, verify facts, and provide live data—something traditional models, which are limited to their last training cutoff, cannot do. While standard models generate responses based on patterns learned from past data, web-enabled models can supplement that with fresh, sourced content from the web, improving accuracy for time-sensitive or niche topics.

At least one third of the references provided by GPT-4o WS were not real! Performance has not significantly improved to the point where AI can write our papers with properly incorporated attribution of ideas. We also found that the web-enabled model would pull from lower quality sources like Investopedia even when we explicitly stated in the prompt, “include citations from published papers. Provide the citations in a separate list, with author, year in parentheses, and journal for each citation.” Even some of the sources that were not journal articles were cited incorrectly. We provide specific examples in our paper.

In closing, consider this quote from an interview with Jack Clark, co-founder of Anthropic:

The best they had was a 60 percent success rate. If I have my baby, and I give her a robot butler that has a 60 percent accuracy rate at holding things, including the baby, I’m not buying the butler.

US State Growth Statistics 2005-2024

In macroeconomics we have basic tools to help us talk about economic growth, which is simply the percent change in RGDP per capita. What causes growth? Lot’s of things. All else constant, if more people are employed, then more will be produced. But the productivity of those workers matters too. That’s why we calculate average labor productivity (ALP), which is the GDP per worker. This tells us how much each worker produces. All else constant, more ALP means more GDP.*

What affects ALP? Nearly everything: Technology, demographics, health, culture, and public policy. Most of these have long-term effects. So, it’s better to think in terms of regimes. After all, incurring debt now can result in a lot of investment and production, but there’s no guarantee that it can be sustained year after year. This is why I don’t get terribly excited about individual good or bad policies at any moment. There’s a lot of ruin in a nation. I care more about the long-run policy regime that is fostered over time.

Given the variety of inputs to economic growth, there’s always plenty of room for complaint about policy – even if the economy is doing well. In this post, I’m inspired by a Youtube video that a student shared with me. The OP laments poor policy in Massachusetts. But compared to some other nearby states, MA is doing just fine economically. This is not the same as saying that the OP is wrong about poor policies. Rather, a regime of policy, technology, interests, etc. is built over time and there can be a lot wrong in growing economies.

In the interest of being comprehensive, this post includes basic growth stats for all states from 2005 through 2024 (the years of FRED-state GDP).** First, let’s start with the basic building blocks of population, employment, and RGDP. Institutions matter. Policy affects whether people migrate to/from the state, fertility, how many people are employed, and what they can produce.

People like to talk about migration and the flocking to Texas & Florida. But that fails to catch the people who choose to stay in their state. Utah is  43% more populous than it was 20 years ago. But you don’t hear much clamoring for their state policies. Idaho and Nevada also beat Florida in terms of percent change. Where are the calls to be like Idaho? Employment largely tracks population, though not perfectly. The RGDP numbers can change quickly with commodity prices, reflected in the performance of North Dakota. But remember, these numbers cover a 20 year span. So, any one blockbuster or dower year won’t move the rankings much.

Of course, these figures just set the stage. What about the employment-population ratio, ALP, and RGDP per capita? Read on.

Continue reading

Understanding Data: A Chart on Gen Z and Alcohol

Have you seen this chart?

The chart originates from Statista, as you can see from the label in the image. But it is very frequently shared on social media, Reddit, and elsewhere (often with the Statista label clipped), occasionally generating millions of views and lots of heated comments.

But it’s a bad graph. In so many ways. Let’s break them down.

The data comes from BLS’s Consumer Expenditures Survey. I use this data frequently, as regular readers probably know. The data in the viral chart is from 2021 (more on that in a moment), but if I create a similar chart using the most recent data in 2023 but also include spending by those older than Baby Boomers (primarily the Silent Generation), you will notice a curious thing:

Continue reading