Illusions of Illusions of Reasoning

Even since Scott’s post on Tuesday of this week, a new response has been launched titled “The Illusion of the Illusion of the Illusion of Thinking

Abstract (emphasis added by me): A recent paper by Shojaee et al. (2025), The Illusion of Thinking, presented evidence of an “accuracy collapse” in Large Reasoning Models (LRMs), suggesting fundamental limitations in their reasoning capabilities when faced with planning puzzles of increasing complexity. A compelling critique by Opus and Lawsen (2025), The Illusion of the Illusion of Thinking, argued these findings are not evidence of reasoning failure but rather artifacts of flawed experimental design, such as token limits and the use of unsolvable problems. This paper provides a tertiary analysis, arguing that while Opus and Lawsen correctly identify critical methodological flaws that invalidate the most severe claims of the original paper, their own counter-evidence and conclusions may oversimplify the nature of model limitations. By shifting the evaluation from sequential execution to algorithmic generation, their work illuminates a different, albeit important, capability. We conclude that the original “collapse” was indeed an illusion created by experimental constraints, but that Shojaee et al.’s underlying observations hint at a more subtle, yet real, challenge for LRMs: a brittleness in sustained, high-fidelity, step-by-step execution. The true illusion is the belief that any single evaluation paradigm can definitively distinguish between reasoning, knowledge retrieval, and pattern execution.

As am writing a new manuscript about hallucination of web-enabled models, this is close to what I am working on. Conjuring up fake academic references might point to a lack of true reasoning ability.

Do Pro and Dantas believe that LLMs can reason? What they are saying, at least, is that evaluating AI reasoning is difficult. In their words, the whole back-and-forth “highlights a key challenge in evaluation: distinguishing true, generalizable reasoning from sophisticated pattern matching of familiar problems…”

The fact that the first sentence of the paper contains the bigram “true reasoning” is interesting in itself. No one dobuts that LLMs are reasoning anymore, at least within their own sandboxes. Hence there have been Champagne jokes going around of this sort:

If you’d like to read a response coming from o3 itself, Tyler pointed me to this:

Salty SALT in the OBBB

The Republicans hold a majority in both chambers of congress and they are the party of the president. They want to use that opportunity to pass substantial legislation that addresses their priorities. Hence, the One, Big, Beautiful Bill (OBBB). But, just like the Democratic party, Republican congressmen are a coalition with various and sometimes divergent policy agendas. There are ‘Trump’ Republicans, who want tariffs, executive orders, and deportations. There are more liberal members who want more free markets. You can also find the odd ‘crypto bro’, blue-state representatives, and deficit hawks. Given the slim majority in the House of Representatives, they all have to get something out of the legislation. Put them together, and what have you got?* You get a signature piece of legislation that no one is happy about but everyone touts.

One example of such compromise is the State and Local Tax federal income tax deduction, or SALT deduction. The idea behind it is that income shouldn’t be taxed twice. If you pay a part of your income to your state government in the form of taxes, then the argument goes that you shouldn’t be taxed on that part of your income because you never actually saw it in your bank account. The state took it and effectively lowered your income. The state and local taxes get deducted from the taxable income that you report to the federal government.  The reasoning is that you shouldn’t need to pay taxes on your taxes.

Paying taxes on your taxes sounds bad. And plenty of people don’t like one tax, much less two. The Tax Foundation has done a lot of good work to cut through the chaff and has published many pieces on the SALT deduction over the years.**

Cut and Dry SALT Deduction Facts:

  • It’s a tax cut
  • It reduces federal tax revenue
  • It adds tax code complication
  • It is used by people who itemize rather than take the standard income tax deduction
  • Prior to the 2017 Tax & Jobs Act, there was no limit on the SALT deduction. After, the limit was $10k.
  • The current OBBB increases the SALT deduction.

Those are the basics. Everything else is analysis. The Grover Norquist Republicans never see a tax cut that looks bad, so they’d like to see the SALT limit raised or disappear. Tax think tanks that like simplicity don’t like the SALT deduction because it adds complication. Plenty of others say they don’t like complication, but often change their mind when it comes to the details (much like cutting government waste). Think tanks tend to be a bit lonely on this point.

People mostly care about the SALT deduction due to the distributional effects. Who ends up benefiting from the deduction? The short answer is people who 1) itemize & 2) have heavy state and local tax bills. Who is that? Rich people of course! They have high incomes and lots of wealth and real estate – on which they pay taxes. But not all rich people pay loads of state taxes. So the SALT deduction is a tax cut that primarily benefits rich people who live in high tax districts. Where’s that? See the below.

Continue reading

LIFE Survey Comes Alive

Last year I posted that the Philly Fed had started a new quarterly survey on Labor, Income, Finances, and Expectations (LIFE). I thought it looked promising but had yet to achieve its potential:

It will be interesting to see if this ends up taking a place in the set of Fed surveys that are always driving economic discussions, like the Survey of Consumer Finances and the Survey of Professional Forecasters. If they keep it up and start putting out some graphics to summarize it, I think it will. My quick impression (not yet having spoken to Fed people about it) is that it will be the “quick hit” version of the Survey of Consumer Finances. It asks a smaller set of questions on somewhat similar topics, but is released quickly after each quarter instead of slowly after each year. If they stick with the survey it will get more useful over time, as there is more of a baseline to compare to.

But a year later the survey now has what I hoped for: a solid baseline for comparisons, and pre-made graphics to summarize the results. It continues to show complex and mixed economic performance in the US. People think the economy is getting worse:

They are cutting discretionary (but not necessity) spending at record levels:

They are worried about losing their jobs at record levels:

But key areas like housing, childcare, and transportation are stabilizing:

Overall I think we can synthesize these seemingly contradictory pictures by saying that Americans’ finances are fine now, but they are quite worried that things are about to get worse, perhaps due to the tariffs taking effect. You can find the rest of the LIFE survey results (including all the non-record-setting ones) here.

Household Formation and Generational Wealth

Last week I tried to address whether rising wealth for younger generations was primarily driven by rising home values. My analysis suggested that it was a cause, but not the only cause. Here’s another chart on that topic, showing median net worth excluding home equity for recent generations:

Two things are notable in the chart. For millennials, even excluding home equity they are well ahead of past generations, though of course their net worth is much smaller excluding this category of wealth (the total median net worth for millennials in 2022 was $93,800). But for Gen X in 2022 (last data in that chart), they are slightly behind Boomers, never having recovered from the decline in wealth after 2007 (primarily from the stock market decline, since we’re excluding housing).

But today I want to address another general objection to the wealth data found in the Fed’s SCF and DFA programs. That objection has to do with household formation. Specifically, these surveys are calculated for households, and the age/generation indicators are for the household head (or “householder” as it is now called). And we know that household formation has been declining over time, as more young people live with parents, with roommates, etc. So the Millennial data we see in the chart above is excluding any Millennials that have not yet formed their own household.

Here’s a general picture of the decline, which has been happening gradually since about 1980. Note: I use the age group 26-41, because this is the age of Millennials in 2022 (the most recent SCF survey year). The highlighted years on the chart are when the Silent, Baby Boomer, Gen X, and Millennial generations were about the same age (26-41).

What this means is that when we are looking at households in these wealth surveys (or any survey that focuses on households) we aren’t quite comparing apples to apples. Does this mean the surveys are worthless? No! With the microdata in the SCF, we can look at not only the median value, but the entire distribution. Since the household formation rate has fallen by about 11 percentage points between Boomers in 1989 and Millennials in 2022, one solution is to look up or down the distribution for a rough comparison.

For example, if we assume all of the 11 percent of non-householders among Millennials have wealth below the median, we can make a rough correction by looking at the 39th percentile for Millennials — the 39th percentile would be the median if you included all of those 11 percent of non-householders as households. Similarly, for Gen X would move down 5 percentage points in the distribution to the 45th percentile in 2007.

The household-formation-adjusted chart does paint a more pessimistic picture than just looking at the median for each generation: the 39th percentile Millennial has about 20% less wealth than the median Boomer did at roughly the same age. Seems like generational decline! Is there any silver lining?

First, you should interpret the chart above as a worst case scenario for Millennial wealth. It assumes all non-householders have low wealth. But likely not all of them do. If instead we use the 43rd percentile of Millennials in 2022, their net worth is $61,000, slightly above Boomers at the same age. (The household formation problem isn’t going away anytime soon as generations age — even if we look at Gen Xers, with a median age of 50 in 2022, their household formation is still 6 percentage points behind Boomers at that age.)

Second, my worst case scenario almost certainly overstates the problem. If all of those 11 percent fewer Millennials not yet forming households were to get married to other millennials, it would only add half of that many households to the aggregate distribution (when two non-householders get married, it becomes one household). So instead of moving down 11 percentage points to the 39th percentile, we should only move down 5 or 6 percentiles. The 44th percentile of Millennial net worth in 2022 was $63,060 — again, compare this to Boomers in the chart above.

Finally, if we combine both of the adjustments discussed in this post, looking at wealth excluding home equity and also adjusting for the decline in household formation, we get the following chart (here I once again use the 39th percentile for Millennials and the 45th percentile for Gen X, i.e., the worst case scenario):

With this final adjustment, we get a slightly different picture. The wealth of these three generations is roughly the same at the same age. No increase in wealth, but no decline either. You could read this as pessimistic, if your assumption is that wealth should rise over time, but the general vibes out there are that young people are worse off than in the past. This wealth data suggests, once again, that the kids are doing all right.

Did Apple’s Recent “Illusion of Thinking” Study Expose Fatal Shortcomings in Using LLM’s for Artificial General Intelligence?

Researchers at Apple last week published with the provocative title, “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity.”  This paper has generated uproar in the AI world. Having “The Illusion of Thinking” right there in the title is pretty in-your-face.

Traditional Large Language Model (LLM) artificial intelligence programs like ChatGPT train on massive amounts of human-generated text to be able to mimic human outputs when given prompts. A recent trend (mainly starting in 2024) has been the incorporation of more formal reasoning capabilities into these models. The enhanced models are termed Large Reasoning Models (LRMs). Now some leading LLMs like Open AI’s GPT, Claude, and the Chinese DeepSeek exist both in regular LLM form and also as LRM versions.

The authors applied both the regular (LLM) and “thinking” LRM versions of Claude 3.7 Sonnet and DeepSeek to a number of mathematical type puzzles. Open AI’s o-series were used to a lesser extent. An advantage of these puzzles is that researchers can, while keeping the basic form of the puzzle, dial in more or less complexity.

They found, among other things, that the LRMs did well up to a certain point, then suffered “complete collapse” as complexity was increased. Also, at low complexities, LLMs actually outperform LRMs. And (perhaps the most vivid evidence of lack of actual understanding on the part of these programs), when they were explicitly offered an efficient direct solution algorithm in the prompt, the programs did not take advantage of it, but instead just kept grinding away in their usual fashion.

As might be expected, AI skeptics were all over the blogosphere, saying, I told you so, LLMs are just massive exercises in pattern matching, and cannot extrapolate outside of their training set. This has massive implications for what we can expect in the near or intermediate future. Among other things, the optimism about AI progress is largely what is fueling the stock market, and also capital investment in this area: Companies like Meta and Google are spending ginormous sums trying to develop artificial “general” intelligence, paying for ginormous amounts of compute power, with those dollars flowing to firms like Microsoft and Amazon building out data centers and buying chips from Nvidia. If the AGI emperor has no clothes, all this spending might come to a screeching crashing halt.

Ars Technica published a fairly balanced account of the controversy, concluding that, “Even elaborate pattern-matching machines can be useful in performing labor-saving tasks for the people that use them… especially for coding and brainstorming and writing.”

Comments on this article included one like:

LLMs do not even know what the task is, all it knows is statistical relationships between words.   I feel like I am going insane. An entire industry’s worth of engineers and scientists are desperate to convince themselves a fancy Markov chain trained on all known human texts is actually thinking through problems and not just rolling the dice on what words it can link together.

And

if we equate combinatorial play and pattern matching with genuinely “generative/general” intelligence, then we’re missing a key fact here. What’s missing from all the LLM hubris and enthusiasm is a reflexive consciousness of the limits of language, of the aspects of experience that exceed its reach and are also, paradoxically, the source of its actual innovations. [This is profound, he means that mere words, even billions of them, cannot capture some key aspects of human experience]

However, the AI bulls have mounted various come-backs to the Apple paper. The most effective I know of so far was published by Alex Lawsen, a researcher at LLM firm Open Philanthropy. Lawsen’s rebuttal, titled “The Illusion of the Illusion of Thinking,  was summarized by Marcus Mendes. To summarize the summary, Lawsen claimed that the models did not in general “collapse” in some crazy way. Rather, the models in many cases recognized that they would not be able to solve the puzzles given the constraints input by the Apple researchers. Therefore, they (rather intelligently) did not try to waste compute power by grinding away to a necessarily incomplete solution, but just stopped. Lawsen further showed that the ways Apple ran the LRM models did not allow them to perform as well as they could. When he made a modest, reasonable change in the operation of the LRMs,

Models like Claude, Gemini, and OpenAI’s o3 had no trouble producing algorithmically correct solutions for 15-disk Hanoi problems, far beyond the complexity where Apple reported zero success.

Lawsen’s conclusion: When you remove artificial output constraints, LRMs seem perfectly capable of reasoning about high-complexity tasks. At least in terms of algorithm generation.

And so, the great debate over the prospects of artificial general intelligence will continue.

Kayfabe in the political marketplace

Kayfabe: the tacit agreement to behave as if something is real, sincere, or genuine when it is not

The term comes from wrestling, which is fitting because for years I’ve been stealing Dana Gould’s line “Politics is just professional wrestling in suits.” There are limits to my casual theft, though. I will not pretend that I am the first to observe that politics has deeply internalized the kayfabe code of vehemently declaring beliefs or expectations in no way actually held while simultaneously understanding that your rivals are doing the same.

I am curious whether you can undermine your own political agenda or influence by going too far, by committing too much to a fictional worldview. Much like Serpico, if you go too deep you can lose track of the real you. Grandstanding in front of the cameras is one thing, but if you want to trade in the political marketplace, you need to be able to credibly do so in good faith. I can’t help but wonder if part of the inability of Congress to assert it’s constitutional power in face of Executive overreach is a political market failue. Has the information and signal quality between represenatives been so eroded that prices and, in turn, exchanges can’t emerge?

On a more optimistic note, however, I expect that at some point a political party will find advantage in have better norms of credible signaling if only because they will have an easier time solving their own collective action problem. The question remains, though, at what point will the political advantage of superior collective action dominate the electoral advantage of earnestly lying to voters?

Waymo and Pictures Online

Memes on Twitter are at least as important today as political cartoons of the 19th and 20th centuries. Two memes went viral this week, surrounding the events of protests in Los Angeles.  

The first meme reflects how Lord of the Rings is deeply embedded in American culture. Two million views means that most people don’t need the joke explained. (Part of the Lord of the Rings story is that the wise and powerful elves abandon chaotic Middle Earth for the safety of the Grey Havens. Similarly, the Waymo cars quietly drove themselves to safety away from Los Angeles after several had been vandalized and burned.)

I worry that The Lord of the Rings has made us too optimistic. Americans rarely tolerate stories that do not have happy endings. Has that made it hard for us to understand global events, or impaired our ability to accurately predict how most battles will end?

The next viral meme about the Waymos has two tweets to track. The originator of the joke got half a million views. Someone who added AI-generated images to the original text got half a million views as well.

The four quadrants form is a common meme format. Starting in the upper left, perhaps the best way to explain this is that a rigid socialist might resent Waymo as a symbol of Big Tech. At the very least, a socialist who wields state control might want to nationalize Waymo if there are going to be autonomous cars.

“Protect the Waymos” voices support for the police and traditional property rules. This is a joke, keep in mind, so it’s not meant to make perfect sense. The Libertarian Right might be more likely to support individuals protecting themselves with their own weapons as opposed to relying on the police state.

Compare the meme world to the news. I feel like this might become quaint soon, so here’s what it looks like to use the Google News search function.

The Los Angeles Times reports

The autonomous ride-hailing service Waymo has suspended operations in downtown Los Angeles after several of its vehicles were set on fire during protests against immigration raids in the area.

At least five Waymo vehicles were destroyed over the weekend, the company said. Waymo removed its vehicles from downtown but continues to operate in other parts of Los Angeles.

The flaming Waymo image is all over the news and internet. For one thing, people are interested in it because it’s real. This was supposed to be the year of deepfakes, and yet it’s mostly real images and real gaffes that are still making the news.

At this moment in time, most Americans do not yet have Waymo in their cities. Oddly enough, losing an inventory of several cars might be well worth the publicity the company is receiving. Two million viewers of the video of the little driverless cars making an orderly exit from Los Angeles might come away thinking driverless cars are safe and sensible.

Here are more posts from my long-standing interest in cartoons and internet culture.

Kyla Scanlon also felt like the little cars on fire was worth a newsletter post. She wrote

We scroll past the burning car to see arguments about whether the burning was justified, then scroll past those to see memes about the arguments, then scroll past those to see counter-memes. The cycle feeds itself!

Retiring for $100 per Month

Everybody follows a different path. Sometimes that path includes a late start on saving for retirement. Say that you have $0 in your retirement account right now. Is it too late? What can you get as a result of contributing $100 per month? Maybe more than you think.

Let’s start with an annuity equation that tells us our balance at retirement with some assumptions baked in. Let’s assume that we have zero dollars saved and contribute $100 per month. What rate of return do we earn? The S&P earns an average of 10% per year, which may not keep happening. We can conservatively assume 7.5%, but there are other concerns. Taxes and inflation will both eat away at that. Let’s subtract 2.5% for inflation with the Fisher approximation, leaving a real rate of return of 5%. We’ll chop off 20% due to taxes*. Below is the annuity equation that tells us the balance at retirement, depending on how many years from now you retire.

Assuming that you retire at 65 years of age, the graph below describes your balance at retirement depending on the age at which you started saving $100 per month. Of course, it’s not the balance that most people are worried about. Rather, we care about the implied monthly retirement check. The graph describes that on the right axis too, assuming that constant real payments will be made forever as perpetuity payments. We can see that getting started early matters a lot. But starting at age 40 still gets you real monthly retirement payments that are just shy of $200. That’s not too shabby.

Of course, nobody receives all of the perpetuity payments.

Continue reading

The End of Easy Student Loans

The Senate Health, Education, Labor and Pensions Committee is proposing to cut off student loans for programs whose graduates earn less than the median high school graduate. The House proposed a risk-sharing model where colleges would partly pay back the federal government when their students fail to pay back loans themselves. Both the House and Senate propose to cap how much students can borrow for graduate loans. Both would reduce federal spending on higher ed by about $30-$35 billion per year, cutting the size of the $700 billion higher ed sector by 4-5%. I expected that something like this would happen eventually, especially after the student loan forgiveness proposals of 2022:

While we aren’t getting real reform now, I do think forgiveness makes it more likely that we’ll see reform in the next few years. What could that look like?

The Department of Education should raise its standards and stop offering loans to programs with high default rates or bad student outcomes. This should include not just fly-by-night colleges, but sketchy masters degree programs at prestigious schools.

Colleges should also share responsibility when they consistently saddle students with debt but don’t actually improve students’ prospects enough to be able to pay it back. Economists have put a lot of thought into how to do this in a manner that doesn’t penalize colleges simply for trying to teach less-prepared students.

I’d bet that some reform along these lines happens in the 2020’s, just like the bank bailouts of 2008 led to the Dodd-Frank reform of 2010 to try to prevent future bailouts. The big question is, will this be a pragmatic bipartisan reform to curb the worst offenders, or a Republican effort to substantially reduce the amount of money flowing to a higher ed sector they increasingly dislike?

Of course, there is a lot riding on the details. How exactly do you calculate the income of graduates of a program compared to high school grads? The Senate proposal explains their approach starting on page 58. They want to compare the median income of working students 4 years after leaving their program (whether they graduated or dropped out, but exempting those in grad school) to the median income of those with only a high school diploma who are age 25-34, working, and not in school.

Nationally I calculate that this would make for a floor of $31,000. That is, the median student who is 4 years out from your program and is working should be earning at least $31k. In practice the bill would implement a different number for each state. This seems like a low bar in general, though you could certainly quibble with it. For instance, those 4 years out from a program may be closer to age 25 than age 34, but income typically rises with age during those years. If you compare them to 26 year old high school grads, the national bar would be just $28k.

What sorts of programs have graduates making less than $31k per year?

Continue reading

The Growth in Wealth is Not Primarily Driven by Rising Home Prices

As I have discussed in many previous blog posts, young people today have a lot more wealth than past generations at the same point in their life. But we also know that housing prices have increased dramatically in recent years, and that for most families their home is their largest source of wealth.

Does this imply that the increase in wealth young Americans have seen is primarily driven by increased housing prices? If so, this would paint a less optimistic picture of the wealth of young people today, since the value of your home that you usually can’t easily convert into other consumption.

If we look at the past 5 years (2019Q4 to 2024Q4), the total wealth US households under the age of 40 increased by $5 trillion, in nominal terms. That’s not adjusted for inflation, but we don’t need to do so because we can look at how much each asset class increased in nominal terms as well. The total value of assets for households under age 40 increased by $5.86 trillion.

Here’s how the various classes of assets have increased since 2019Q4:

Continue reading