Does GPT-4 Know How High the Alps Are?

I’m getting ready to give some public local talks about AI. Last week I shared some pictures that I think might help people understand ChatGPT, specifically:

My first thought is that GPT-4 was giving incorrect estimates of the heights of these mountains because it does not actually “know” the correct elevations. But then a nagging question came to mind.

GPT has a “creativity parameter.” Sometimes, it intentionally does not select the top-rated next word in a sentence, for example, in order to avoid being stiff and boring. Could GPT-4 know the exact elevation of these mountains, and it is just intentionally being “creative,” in this case?

I do not want to stand up in front of the local Rotary Club and say something wrong. So, I went to a true expert, Lenny Bogdonoff, to ask for help. Here is his reply:

Not quite. It’s not that it knows or doesn’t know, but based on the prompt, it’s likely unable to parse the specific details and is outputting results respectively. There is a component of stochastic behavior based on what part of the model weights are activated.

One common practice to help avoid this and see what the model does grasp, is to ask it to think step by step, and explain its reasoning. When doing this, you can see the fault in logic.

All that being said, the vision model is actually faulty in being able to grasp the relative position of information, so this kind of task will be more likely to hallucinate.

There are better vision models, that aren’t OpenAI based. For example Qwen-VL-Max is very good, from the Chinese company Alibaba. Another is LLaVA which uses different baselines of open source language models to add vision capabilities

Depending on what you are needing vision for, models can be spiky in capability. Good at OCR but bad at relative positioning. Good at classifying a specific UI element, but bad at detecting plants, etc etc. 

Joy: So, I think I can tell the Rotary Club that GPT was “wrong” as opposed to “intentionally creative.” I think, as I originally concluded, you should not make ChatGPT the pilot of your airplane and go to sleep when approaching the Alps. ChatGPT should be used for what it is good at, such as writing the rough draft of a cover letter. (We have great “autopilot” software for flying planes, already, without involving large language models.)

Another expert, Gavin Leech, also weighed in with some helpful background information:

  • the creativity parameter is known as temperature. But you can actually radically change the output (intelligence, style, creativity) by using more complicated sampling schemes. The best analogy for changing the sampling scheme is that you’re giving it a psychiatric drug. Changing the prompt, conversely, is like CBT or one of those cute mindset interventions.
  • For each real-name model (e.g. “gpt-4-0613”), there’s 3 versions: the base model (which now no one except highly vetted researchers have access to), the instruction-tuned model, and the RLHF (or rather RLAIF) model. The base model is wildly creative, unhinged, but the RLHF one (which the linked researchers use) is heavily electroshocked into not intentionally making things up (as Lenny says).
  • It’s currently not usually possible to diagnose an error – the proverbial black box. My friends are working on this though
  • For more, note OpenAI admitting the “laziness” of their own models. the Turbo model line is intended to fix this.

Thank you, Lenny and Gavin, for donating your insights.

A Measure of Dissimilarity

I recently learned about an interesting statistic for social scientists. It’s called the “Dissimilarity Index”. It allows you to compare the categorical distribution of two sets.

Many of us already know how to compare two distributions that have only 2 possible values. It’s easy because if you know the proportion of a group who are in category 1, then you know that 1-p will be in category 2. We can conveniently denote these with values of zero and one, and then conduct standard t-tests or z-tests to discover whether they are statistically different. But what about distributions across more than two possible categories?

Continue reading

Historical State GDP Data

Data on Gross State Product prior to 2017 has disappeared from the main page of the Bureau of Economic Analysis. It is also gone from some third party hosts like FRED. It turns out BEA is in the middle of revising how they calculate state GDP; they have the new version done back to 2017, and took down the older inconsistent estimates until they can recalculate them. After that, they tell me they will repost pre-2017 state Gross Domestic Product:

In the mean time, they offer some messy and seemingly incomplete versions of pre-2017 GDP here, and you can find 1980-2021 state GDP (along with many other nice variables) in a nice panel from the University of Kentucky Center for Poverty Research’s National Welfare Data.

You can find more details on the actual changes BEA is making to how they calculate GDP here. Most changes seem relatively minor for states, but might have more impact on the measured relative size of industries. For instance, “equity REITs will be reclassified from the funds, trusts, and other financial vehicles industry to the real estate industry, while mortgage REITs will remain classified as funds, trusts, and other financial vehicles”.

Why Was Federal Tax Revenue Down in 2023?

The year 2023 was a pretty good one for the economy, whether judged by the labor market or economic growth. Despite this good economic growth, total receipts of the federal government were down about 7 percent from 2022 (note: I’m using calendar years, rather than fiscal years). Here’s a chart (note: in NOMINAL dollars) of total federal revenue since 2009:

I want to stress that these are nominal dollars (there, I’ve said it three times, hopefully there is no confusion). Nominal dollars are usually not the best way to look at historical data, but for purposes of looking at recent government budgets, sometimes it is. Especially when revenue is declining: if I adjusted this for inflation, the decline in 2023 would be even larger!

You’ll notice also that the decline in 2023 is even larger than the decline in 2020, the height of the pandemic when many people were out of work due to government regulations and changes in consumer behavior. The 2023 decline is big!

So, what the heck in going on with federal revenue in 2023?

Continue reading

Why don’t Americans migrate anymore?

Inter- and intratstate migration has collapsed within the United States over the last 60 years.

From “Understanding migration aversion using elicited counterfactual choice probabilities” Kosar, Ransom, and van der Klauw, Journal of Econometrics (2022).

This observation has been lurking in the not-completely-the-background of labor economics for 20 years. The best summary of the literature, by Jia et al, came out in the JEL last year. It’s a very useful and relatively complete treatment. I think a lot about migration these days, mostly asking about the determinants of our reservation wage of migration i.e. how much of a wage increase would someone have to offer you to pick up and move to an entirely new community.

If we take the above figure at face value, it would appear the reservation wage of migraton has increased for Americans. Quite a bit, actually. Of course, it is also possible that American’s no longer have to migrate to find a better job or wage, but that seems a hardly universal phenomenon over that last 50 years. Work from home has only had traction for a decade now at best. Labor markets aren’t as concentrated as they perhaps once were, loosening the monopsonistic lid a bit in a lot of markets, but at the same time the labor share of revenues hasn’t increased, in fact it’s decreased on average. So why aren’t people moving? I have some narrow, testable, answers that I am pursuing as research projects, but I also have a broad hypothesis that seems supremely untestable for the moment, sitting as it were in that thinkpiece uncanny valley between narrow research and podcast cheaptalk.

The concept of diminishing returns is an easy intuition to adopt. We all know the first bite of dessert is the best, the last the one most encumbered with regret. Economic growth remains a miracle, but it’s still true that nothing gained at the margin of the modern developed world will ever compete with those first steps out of subsistence. Very nearly every form of household capital and consumption in the modern work is characterized by some amount of diminishing returns, but that doesn’t mean they are diminishing at the same rate.

There are some goods for which there are few, if any substitutes. Demand for these special goods can be quite inelastic – we’re willing to sacrifice a lot to maintain a certain level of consumption. There are also goods that are quite complementary with one another, their value to us increasing as they are bundled together. Complements are powerful, putting together the right mix of consumption is quite literally the recipe for a better life. Complements, however, are often part and parcel to a cautionary tale. If something is a powerful complement to everything else in your life, we might find ourselves saying things like “I can’t live without it.” No amount of wealth in the world can survive being multiplied by zero.

There are few, if any, substitutes for friends, family, and our broader social network. Some goods cannot be consumed alone. If the sociological literature is to be believed, Americans are lonelier than ever. Both our deepest and most casual friendships have diminished in number. Relationships with neighbors are nonexistent, our ties to communities at a premium. That might, at first blush, make it sound like moving should be less costly– why stick around to maintain relationships that you don’t have. But on the other hand, if new relationships are harder to form than previously, then the relationships you already have are worth more than ever, to be protected jealously. Your only friend is, by definition, your best friend. No one wants to move away from their best friend.

Putting it all together, if personal relationships are an inelastic demand good that is complementary with a large chunk of our consumption bundle, then the price, the shadow price, we are willing to pay for it is going to go through the roof in the face of a negative supply shock. In a world where relationships are sudenly at a premium, you will be willing to forego a lot of additional income in order to preserve a small network in which you have a lot of social capital.

In two weeks half the country is going to be watching the Superbowl with a 3 or 4 friends, maybe 10 or 20. Thanks to innovation and economic growth, most people will be watching it on a 55 inch high definititon television with decent food and beverages. What about game would change if the host got a 20% raise? The TV might get a a little bigger, the snacks less fried, the beer more imported. What if the host moved two years ago? Would they have friends close enough to invite over? $15,000 worth of catering is a poor substitute for having someone to high-five.

The more I think about our lives and how little economic pressure, survival pressure, there is to find a 10% higher wage in the modern developed world, I’m surprised anyone migrates at all.

How ChatGPT works from geography and Stephen Wolfram

By now, everyone should consider using ChatGPT and be familiar with how it works. I’m going to highlight resources for that.

My paper about how ChatGPT generates academic citations should be useful to academics as a way to quickly grasp the strengths and weakness of ChatGPT. ChatGPT often works well, but sometimes fails. It’s important to anticipate how it fails. Our paper is so short and simple that your undergraduates could read it before using ChatGPT for their writing assignments.

A paper that does this in a different domain is “GPT4GEO: How a Language Model Sees the World’s Geography” (Again, consider showing it to your undergrads because of the neat pictures, but probably walk through it together in class instead of assigning it as reading.) They describe their project: “To characterise what GPT-4 knows about the world, we devise a set of progressively more challenging experiments… “

For example, they asked ChatGPT about the populations of countries and found that: “For populations, GPT-4 performs relatively well with a mean relative error (MRE) of 3.61%. However, significantly higher errors [occur] … for less populated countries.”

ChatGPT will often say SOMETHING, if prompted correctly. It is often, at least slightly, wrong. This graph shows that most estimates of national populations were not correct and the performance was worse on countries that are less well-known. That’s exactly what we found in our paper on citations. We found that very famous books are often cited correctly, because ChatGPT is mimicking other documents that correctly cite those books. However, if there are not many documents to train on, then ChatGPT will make things up.

I love this figure from the geography paper showing how ChatGPT estimates the elevations of mountains. This visual should be all over Twitter.

There are 3 lines because they did the prompt three times. ChatGPT threw out three different wrong mountains. Is that kind of work good enough for your tasks? Often it is. The shaded area in the graph is the actual topography of the earth in those places. ChatGPT “knows” that this area of the world is a mountain. But it will just put out incorrect estimates of the exact elevation, instead of stating that it does not know the exact elevation of those areas of the world.

Another free (long, advanced) resource with great pictures is Stephen Wolfram’s 2023 blog article “What Is ChatGPT Doing … and Why Does It Work?” (YouTube version)

The first thing to explain is that what ChatGPT is always fundamentally trying to do is to produce a “reasonable continuation” of whatever text it’s got so far, where by “reasonable” we mean “what one might expect someone to write after seeing what people have written on billions of webpages, etc.

If you feel like you already are proficient with using ChatGPT, then I would recommend Wolfram’s blog because you will learn a lot about math and computers.

Scott wrote “Generative AI Nano-Tutorial” here, which has the advantage of being much shorter than Wolfram’s blog.

EDIT: New 2023 overview paper (link from Lenny): “A Survey of Large Language Models

Government Purchases and How Markets Avoid Messes

The government is unique among economic institutions insofar as it can use coercion legally. But not all activities are coercive. Clearly, taxation is overwhelmingly coercive. Some people say that they are happy to pay taxes, but the voluntary gifts to the US Treasury are itsy-bitsy (just over $1m for FY 2023). Most regulations also include the threat of fines or jail time for non-compliance.

But once the government has the money in their coffers, there is plenty that they can do consensually. Once they have the resources, they are often just another potential transactor in the markets for goods and services. While the government can transact as well as anyone else, there is a fundamental theoretical difference for how we should interpret those transactions. Specifically, there is a principal-agent problem such that we can’t quite identify the welfare that is enjoyed by consumers when the government makes purchases. We really have very little idea.

Garett Jones uses the analogy of the government confiscating potatoes. The worst use would be for the government to throw the valuable resources into the river. Those resources help no one. Improved welfare would be yielded if the government just transferred those potatoes back to people. Sure, there’s the transaction cost of administration, but people get their potatoes back. Finally, the great hope is that the government takes the potatoes and makes tasty potato fritas such that they return to the public something more valuable than they took. These might be things that fall into the public goods category or solving collective action problems generally.

The above examples illustrates that how the government spends matters a lot for the welfare implications of the newly purchased government resources. But, we need to recall that there is an entire private segment of the market that is affected by the government transactions.

Short-Run Analysis

In a competitive market, firms face increasing marginal costs and make decisions about their levels of output. When the government makes purchases, it’s simply acting as another demander. How does the entry of a larger demander affect everyone else in the market? See the below GIF.

Continue reading

Does More Health Spending Buy Better Outcomes for States?

When you look across countries, it appears that the first $1000 per person per year spent on health buys a lot; spending beyond that buys a little, and eventually nothing. The US spends the most in the world on health care, but doesn’t appear to get much for it. A classic story of diminishing returns:

Source: https://twitter.com/MaxCRoser/status/810077744075866112/photo/1

This might tempt you to go full Robin Hanson and say the US should spend dramatically less on health care. But when you look at the same measures across US states, it seems like health care spending helps after all:

Source: My calculations from 2019 IHME Life Expectancy and 2019 KFF Health Spending Per Capita

Last week though, I showed how health spending across states looks a lot different if we measure it as a share of GDP instead of in dollars per capita. When measured this way, the correlation of health spending and life expectancy turns sharply negative:

Source: My calculations from 2019 IHME life expectancy, Gross State Product, and NHEA provider spending

Does this mean states should be drastically cutting health care spending? Not necessarily; as we saw before, states spending more dollars per person on health is associated with longer lives. States having a high share of health spending does seem to be bad, but this is more because it means the rest of their economy is too small, rather than health care being too big. Having a larger GDP per capita doesn’t just mean people are materially better off, it also predicts longer life expectancy:

Source: My calculations from 2019 IHME life expectancy and 2019 Gross State Product

As you can see, higher GDP per capita predicts longer lives even more strongly than higher health spending per capita. Here’s what happens when we put them into a horse race in the same regression:

The effect of health spending goes negative and insignificant, while GDP per capita remains positive and strongly significant. The coefficient looks small because it is measured in dollars, but what it means is that a $10,000 increase in GDP per capita in a state is associated with 1.13 years more life expectancy.

My guess is that the correlation of GDP and life expectancy across states is real but mostly not caused by GDP itself; rather, various 3rd factors cause both. I think the lack of effect of health spending across states is real, between diminishing returns to spending and the fact that health is mostly not about health care. Perhaps Robin Hanson is right after all to suggest cutting medicine in half.

Young People Have a Lot More Wealth Than We Thought

I’ve written numerous times about generational wealth on this blog. My biggest post was one comparing different generations using the Fed’s Distributional Financial Accounts back in September 2021. I’ve posted several updates to that post as new the quarterly data was released, but this post contains a major update. I’ll explain in great detail below about the updates, but first let me present the latest version of the chart (through 2023q3):

Regular readers will notice a few differences compared with past charts. The big one is that young people have a lot more wealth than it appeared in past versions of this chart! You’ll also notice that I have relabeled this line “Millennials & Gen Z (18+)” and shifted that line over to the left a few years to account for the fact that this isn’t just the wealth of Millennials, and therefore the median age of this group is lower than in my past charts. The two dollar figures I highlighted are at the median age of 30 for these age cohorts (unfortunately we don’t have data for Boomers at that age).

Continue reading