Rat to Research Discourse

I made this Decreasing Marginal Utility rat picture when I was an undergraduate, and it caught on. A textbook asked me for permission to print it.

This week on Twitter (X.com), someone said it was their favorite graph. Upon replying I learned that he had used it for teaching. It’s fun when you know one of your ideas is out in the world helping people.

For real? I absolutely HOWLED when I found it on a google image search! Bravo! I taught HS Econ for many years and this was the kind of stuff that kept kids awake!

https://twitter.com/arburnside/status/1702690454884487495

Blogger privilege is to manifest a new conversation on here. If one of my research articles were to achieve the same level of influence as the stuffed rat, then people might tweet something along the following lines:

An Experiment on Protecting Intellectual Property,” (2014), with Bart Wilson. Experimental Economics, 17:4, 691-716.

This original project, both in terms of methodology and subject, is one of the first controlled experiments on intellectual property protection, which has inspired subsequent lab work on this issue. We present a color cube mechanism that provides a creative task for subjects to do in an experiment on creative output. The results indicate that IP protection alone does not cause people to become inventors, although entrepreneurs are encouraged to specialize by IP protection.

Smile, Dictator, You’re On Camera,” (2017), with Matthew McMahon, Matthew Simpson and Bart Wilson. Southern Economic Journal, 84:1, 52-65.

The dictator game (DG) is attractive because of its simplicity. Out of thousands of replications of the DG, ours is probably the controlled experiment that has reduced “social distance” to the farthest extreme possible, while maintaining the key element of anonymity between the dictator and their receiver counterpart. In our experiment the dictator knows they are being watched, which is the opposite of the famous “double-blind” manipulation that removed even the view of the experimenter. As we predicted, people are more generous when they are being watched. Anyone teaching about DGs in the classroom should show our entertaining video of dictators making decisions in public: https://www.youtube.com/watch?v=vZHN8xyp6Y0&t=22s

My Reference Point, Not Yours,” (2020) Journal of Economic Behavior and Organization, 171: 297-311.

There is a lot of talk about reference points. No matter how you feel about “behavioral” economics, I don’t think anyone would deny that reference-dependent behavior explains some choices, even very big ones like when to sell your house. Considering how important reference points are, can people conceive of the fact that different people have different reference points shaped by their different life experiences? Results of this study imply that I tend to assume that everyone else has my own reference point, which biases my beliefs about what others will do. Because this paper is short and simple, it would make a good assignment for students in either an experimental or econometrics class. I have a blog post on how to turn this paper into an assignment for students who are just learning about regression for the first time.

If Wages Fell During a Recession,” (2022) with Daniel Houser, Journal of Economic Behavior and Organization.  Vol. 200, 1141-1159.

The title comes from Truman Bewley’s book Why Wages Don’t Fall during a Recession. First, I’ll take some lines directly from his book summary:

A deep question in economics is why wages and salaries don’t fall during recessions. This is not true of other prices, which adjust relatively quickly to reflect changes in demand and supply. Although economists have posited many theories to account for wage rigidity, none is satisfactory. Eschewing “top-down” theorizing, Truman Bewley explored the puzzle by interviewing—during the recession of the early 1990s—over three hundred business executives and labor leaders as well as professional recruiters and advisors to the unemployed.

By taking this approach, gaining the confidence of his interlocutors and asking them detailed questions in a nonstructured way, he was able to uncover empirically the circumstances that give rise to wage rigidity. He found that the executives were averse to cutting wages of either current employees or new hires, even during the economic downturn when demand for their products fell sharply. They believed that cutting wages would hurt morale, which they felt was critical in gaining the cooperation of their employees and in convincing them to internalize the managers’ objectives for the company.

We are one of the first to take this important question to the laboratory. The nice thing about an experiment is that you can measure shirking precisely and you can get observations on wage cuts, which are rare in the naturally occurring American economy.

We find support for the morale theory, but a new puzzle got introduced along the way. Many of our subjects in the role of the employer cut the wages of their counterpart, which probably lowered their payment. Why didn’t they anticipate the retaliation against wage cuts? That question inspired the paper “My Reference Point, Not Yours.”

Other people’s money: preferences for equality in groups,” (2022) with Gavin Roberts, European Journal of Political Economy, Vol. 73.

Andreoni & Miller (2002) have been cited over 2500 times for their experiment that shows demand curves for altruism slope down. Economic theory is not broken by generosity. We extend their work to show that demand curves for equality slope down. Individuals don’t love inequality, but they also don’t love parting with their own money. There is a higher demand for reducing inequality with other people’s money than with own income.

Willingness to be Paid: Who Trains for Tech Jobs?” (2022), Labour Economics, Vol 79, 102267. 

This is the last paper I’ll do here. At this point, readers probably would like a funny animal picture. Here’s a meme about the difficult life of computer programmers:

For decades, tech skills have had a high return in the labor market. There is very little empirical work on why more people do not try to become computer programmers, although there are policy discussions about confidence and encouragement.

I ran an experiment to measure something that is important and underexplored. One thing I found is that attempts to increase confidence, if not carefully evaluated, might backfire.

Would you predict it’s more important to have taken a class in programming or for a potential worker to report that they enjoy programming? My results imply that we should be doing more to understand both the causes and effects of subjective preferences (enjoyment) for tech work. 

A few more decades to go here… I will try to top the stuffed rat picture.

Cool the Schools

Short post today because I’m busy watching my kids, who had their school canceled because of excessive heat, like many schools in Rhode Island today.

I thought this was a ridiculous decision until my son told me he heard from his teacher that his elementary school is the only one in town that has air conditioning for every classroom. Given that, the decision to cancel given the circumstances is at least reasonable, but the lack of AC is not.

It’s not just that hot classrooms are unpleasant for students and staff, or that sudden cancellations like this are a major burden for parents. Several economics papers have found that air conditioning significantly improves students’ learning as measured by test scores (though some find not). Park et al. (2020 AEJ: EP) find that:

Student fixed effects models using 10 million students who retook the PSATs show that hotter school days in the years before the test was taken reduce scores, with extreme heat being particularly damaging. Weekend and summer temperatures have little impact, suggesting heat directly disrupts learning time. New nationwide, school-level measures of air conditioning penetration suggest patterns consistent with such infrastructure largely offsetting heat’s effects. Without air conditioning, a 1°F hotter school year reduces that year’s learning by 1 percent.

This can actually be a bigger issue in somewhat Northern places like Rhode Island- we’re South enough to get some quite hot days, but North enough that AC is not ubiquitous. Data from the Park paper shows that New York and New England are actually some of the worst places for hot schools:

This is because of the lack of AC in the North:

The days are only getting hotter…. it’s time to cool the schools.

Long Covid is Real in the Claims Data… But so is “Early Covid”?

I’ve seen plenty of investigations of “Long Covid” based on surveys (ask people about their symptoms) or labs (x-ray the lungs, test the blood). But I just ran across a paper that uses insurance claims data instead, to test what happens to people’s use of medical care and their health spending in the months following a Covid diagnosis. The authors create some nice graphics showing that Long Covid is real and significant, in the sense that on average people use more health care for at least 6 months post-Covid compared to their pre-Covid baseline:

Source: Figure 5 of “Long-haul COVID: healthcare utilization and medical expenditures 6 months post-diagnosis“, BMC Health Services Research 2022, by Antonios M. Koumpias, David Schwartzman & Owen Fleming

The graph is a bit odd in that its scales health spending relative to the month after people are diagnosed with Covid. Their spending that month is obviously high, so every other month winds up being negative, meaning just that they spent less than the month they had Covid. But the key is, how much less? At baseline 6 months prior it was over $1000/month less. The second month after the Covid diagnosis it was about $800 less- a big drop from the Covid month but still spending $200+/month more than baseline. Each month afterwards the “recovery” continues but even by month 6 its not quite back to baseline. I’m not posting it because it looks the same, but Figure 4 of the paper shows the same pattern for usage of health care services. By these measures, Long Covid is both statistically and economically significant and it can last at least 6 months, though worried people should know that it tends to get better each month.

I was somewhat surprised at the size of this “post Covid” effect, but much more surprised at the size of the “pre Covid” or “early Covid” effect- the run-up in spending in the months before a Covid diagnosis. For the month immediately before, the authors have a good explanation, the same one I had thought of- people are often sick with Covid a couple days before they get tested and diagnosed:

There is a lead-up of healthcare utilization to the diagnosis date as illustrated by the relatively high utilization levels 30–1 days before diagnosis. This may be attributed to healthcare visits only days prior to the lab-confirmed infection to assess symptoms before the manifestation or clinical detection of COVID-19.

But what about the second month prior to diagnosis? People are spending almost $150/month more than at the 6-month-prior baseline and it is clearly statistically significant (confidence intervals of months t-6 and t-2 don’t overlap). The authors appear not to discuss this at all in the paper, but to me ignoring this lead-up is burying the lede. What is going on here that looks like “Early Covid”?

My guess is that people were getting sick with other conditions, and something about those illnesses (weakened immune system, more time in hospitals near Covid patients) made them more likely to catch Covid. But I’d love to hear actual evidence about this or other theories. The authors, or someone else using the same data, could test whether the types of health care people are using more of 2 months pre-diagnosis are different from the ones they use more of 2 months post-diagnosis. Doctors could weigh in on the immunological plausibility of the “weakened immune system” idea. Researchers could test whether they see similar pre-trends / “Early Covid” in other claims/utilization data; probably they have but if these pre-trends hold up they seem worthy of a full paper.

Comprehensive Cancer Centers: Expensive But Fast

An article I coauthored, “Comparing hospital costs and length of stay for cancer patients in New York State Comprehensive Cancer Centers versus nondesignated academic centers and community hospitals“, was just published in Health Services Research. We find that:

Inpatient costs were 27% higher (95% CI 0.252, 0.285), but length of stay was 12% shorter (95% CI −0.131, −0.100), in Comprehensive Cancer Centers relative to community hospitals.

In other words, these cutting-edge hospitals that tend to treat complex cases are more expensive, as you would expect; but despite getting tough cases they actually manage a shorter average length of stay. We can’t nail down the mechanism for this but our guess is that they simply provide higher-quality care and make fewer errors, which lets people get well faster.

What are Comprehensive Cancer Centers? Here’s what the National Cancer Institute says:

The NCI Cancer Centers Program was created as part of the National Cancer Act of 1971 and is one of the anchors of the nation’s cancer research effort. Through this program, NCI recognizes centers around the country that meet rigorous standards for transdisciplinary, state-of-the-art research focused on developing new and better approaches to preventing, diagnosing, and treating cancer.

Our paper focuses on New York state because of their excellent data, the New York State Statewide Planning and Research Cooperative System Hospital Inpatient Discharges dataset, which lets us track essentially all hospital patients in the state:

We use data on patient demographics, total treatment costs, and lengths of stay for patients discharged from New York hospitals with cancer-related diagnoses between 2017 and 2019.

You know I’m all about sharing data; you can find our data and code for the paper on my OSF page here.

My coauthor on this paper is Ryan Fodero, who wrote the initial draft of this paper in my Economics Senior Capstone class last Fall. He is deservedly first author- he had the idea, found the data, and wrote the first draft; I just offered comments, cleaned things up for publication, and dealt with the journal. I’ve published with undergraduates several times before but this is the first time I’ve seen one of my undergrads hit anything close to a top field journal. You can find a profile of Ryan here; I suspect it won’t be the last you hear of him.

Mortgage Fraud Is Surprisingly Common Among Real Estate Investors

That is the conclusion of a recent Philadelphia Fed working paper by Ronel Elul, Aaron Payne, and Sebastian Tilson. The fraud is that investors are buying properties to flip or rent out, but claim they are buying them to live there in order to get cheaper mortgages:

We identify occupancy fraud — borrowers who misrepresent their occupancy status as owner-occupants rather than investors — in residential mortgage originations. Unlike previous work, we show that fraud was prevalent in originations not just during the housing bubble, but also persists through more recent times. We also demonstrate that fraud is broad-based and appears in government-sponsored enterprise and bank portfolio loans, not just in private securitization; these fraudulent borrowers make up one-third of the effective investor population. Occupancy fraud allows riskier borrowers to obtain credit at lower interest rates. 

One third of all investors is a lot of fraud! The flip side of this is that real estate investors are much more prevalent than the official data says:

We argue that the fraudulent purchasers that we identify are very likely to be investors and that accounting for fraud increases the size of the effective investor population by nearly 50 percent.

Many people blame investors for making housing unaffordable for regular people. Economists tend to disagree, and one of our arguments has been to point out that investors are still a small fraction of home buyers. However, official statistics recently showed the investor share over 25% (though dropping fast), and apparently that may still be an understatement. If investors are a problem, there are enough of them to be a big problem.

Of course, there are other reasons economists aren’t so concerned about real estate investors. One is that they can provide the valuable service of renting out homes to people who couldn’t qualify for a mortgage themselves (especially after 2010, when Dodd Frank made it difficult for people without great credit to qualify). Another is that many investors seem to be surprisingly bad at flipping homes for higher prices. The panic over “ibuyers” that would buy houses sight unseen based on algorithms abated when it turned out those those companies lost a ton of money, saw their stock prices plunge, and gave up.

The mortgage fraud paper also provides evidence of investors losing money. In particular, rather than fraudulent investors crowding out the good ones, they are actually more likely to end up defaulting on their purchases:

These fraudulent borrowers perform substantially worse than similar declared investors, defaulting at a 75 percent higher rate.

Still, such widespread fraud is concerning, and I hope lenders (especially the subsidized GSEs) find a way to crack down on it. Based on things I see people bragging about on social media, I’m guessing that tax fraud is also widespread in real estate investing, though I haven’t looked into the literature on it.

This mortgage fraud paper seems like a bombshell to me and I’m surprised it seems to have received no media attention; journalists take note. For everyone else, I suppose you read obscure econ blogs precisely to find out about the things that haven’t yet made the papers.

Replicating Research with Restricted Data

If a scientific finding is really true and important, we should be able to reproduce it- different researchers can investigate and confirm it, rather than just taking one researcher at their word.

Economics has not traditionally been very good at this, but we’re moving in the right direction. It is becoming increasingly common for researchers to voluntarily post their data and code, as well as for journals (like the AEA journals) to require them to:

Source: This talk by Tim Errington

This has certainly been the trend with my own research; if you look at my first 10 papers (all published prior to 2018) I don’t currently share data for any of them, though I hope to go back and add it some day. But of my most recent 10 empirical papers, half share data.

This sharing allows other researchers to easily go back and check that the work is accurate. This could mean simply checking that it is “reproducable”, i.e., that running the original code on the original data produces the results that the authors said. Or it could mean the more ambitious “replicability”, i.e., you could tackle the same question with different data and still find basically the same answer. Economics does generally does well at reproducability when code is shared, but just ok at replication.

Of course, even when data and code are shared, you still need people to actually do the double-checking research; this is still relatively rare because it is harder to publish replications than original research. But more replication journals are opening, and there are now several projects funding replications. The trends are all in the right direction to establish real, robust findings, with one exception- the rise of restricted data.

Traditionally most economics research has been done using publicly available datasets like the Current Population Survey. But an increasing proportion, perhaps a majority of research at top journals, is now done using restricted datasets (there’s a great graph on this I can’t find but see section 3.3 here). These datasets legally can’t be shared publicly, either due to privacy concerns,licensing agreements, or both. But journals almost always still publish these articles and give them an exemption to the data sharing requirement. One the one hand it makes sense not to ignore this potentially valuable research when there are solid legal reasons the data can’t be shared. But it does mean we can’t be as confident that the data has been analyzed correctly, or that it even really exists.

One potential solution is to find people who have access to the same restricted dataset and have them do a replication study. This is what the Institute for Replication just started doing. They posted a list of 100+ papers that use restricted data that they would like to replicate. They are offering $5000 for replications of most of the papers, so I think it is worthwhile for academics to look and see if you already have access to relevant datasets, or if you study similar enough things that it is worth jumping through the hoops to get data access.

For everyone else, this is just one more reason not put too much trust in any one paper you read now, but to recognize that the field as a whole is getting better and more trustworthy over time. We will be more likely to catch the mistakes, purge the frauds, and put forward more robust results that at least bear a passing resemblance to what science can and should be.

Wives Slightly Out-earning Husbands Is No Longer Weird

As we have gone through our education and training and changed jobs, my wife and I have been in every sort of relative income situation, with each one sometimes vastly or slightly out-earning the other. Currently she slightly out-earns me, which I thought was unusual, as I remembered this graph from Bertrand, Kamenica and Pan in the QJE 2015:

Ungated source: Bertrand Pan Kamenica 2013

The paper argues that the big jump down at 50% is driven by gender norms:

this pattern is best explained by gender identity norms, which induce an aversion to a situation where the wife earns more than her husband. We present evidence that this aversion also impacts marriage formation, the wife’s labor force participation, the wife’s income conditional on working, marriage satisfaction, likelihood of divorce, and the division of home production. Within marriage markets, when a randomly chosen woman becomes more likely to earn more than a randomly chosen man, marriage rates decline. In couples where the wife’s potential income is likely to exceed the husband’s, the wife is less likely to be in the labor force and earns less than her potential if she does work. In couples where the wife earns more than the husband, the wife spends more time on household chores; moreover, those couples are less satisfied with their marriage and are more likely to divorce.

But when I went to look up the paper to show my wife the figures, I found that the effect it highlights may no longer be so large.  Natalia Zinovyeva and Maryna Tverdostup show in their 2021 AEJ paper that the jump down in wives’ income at 50% is quite small, and is largely driven by couples who have the same industry and occupation:

They created the figure above using SIPP/SSA/IRS Completed Gold Standard Files, 1990–2004. I’d be interested in an analysis with more recent data. Much of their paper uses more detailed Finnish data to test the mechanism for the remaining jump down at 50%. They conclude that gender norms are not a major driver of the discontinuity:

We argue that the discontinuity to the right of 0.5 can emerge if some couples tend toward earnings equalization or convergence. To test this hypothesis, we exploit the rich employer-employee–linked data from Finland. We find overwhelming support in favor of the idea that the discontinuity is caused by earnings equalization in self-employed couples and earnings convergence among spouses working together. We show that the discontinuity is not generated by selective couple formation or separation and it arises only among self-employed and coworking couples, who account for 15 percent of the population.

Self-employed couples are responsible for most observations with spouses reporting identical earnings. When couples start being self-employed, both sides of the distribution tend to equalize earnings, perhaps because earnings equalization helps couples to reduce income tax payments, facilitate accounting, or avoid unnecessary within-family negotiations. Large spikes emerge not only at 0.5 but also at other round shares signaling the prevalence of ad hoc rules for entrepreneurial income sharing in couples. Self-employment is associated with a fall of household earnings below the level predicted by individuals’ predetermined characteristics, but this drop is mainly due to a decrease in male earnings, with women being relatively better off.

In the case of couples who work together in the same firm, there is a compression of the earnings distribution toward 0.5 both on the right and on the left of 0.5. As a result, there is an increase both in the share of couples where men slightly outearn their wives and in the share of couples where women slightly outearn their husbands. Since the former group is larger, earnings compression leads to a detection of a discontinuity.

So, concerns about relative earnings aren’t causing trouble for women in the labor market. But do they cause trouble at home? Perhaps yes, but if so its not in a gendered way and not driven by the 50% threshold:

Separation rates do not exhibit any discontinuity around the 0.5 threshold of relative earnings. Instead, the relationship between the probability of separation and the relative earnings distribution exhibits a U-shape, with higher separation rates among couples with large earnings differentials either in favor of the husband or in favor of the wife.

You Cannot Cut Nominal Wages: Weavers in 1738

I’m reading The Fabric of Civilization (see my AdamSmithWorks on specialization). This is a fascinating story about cloth and markets:

In November 1738, clothier Henry Coulthurst informed weavers that he was cutting their piecework rates and would henceforth pay them in goods rather than cash. Needless to say, they were upset. Food prices were rising, and lower wages meant hunger and want.

Over three days in December, the weavers rioted. They smashed Coulthurst’s mill, wrecked his home, and “drank, carried out, and spilt, all the Beer, Wine and Brandy in the cellars.” They returned the following day to demolish Coulthurst’s house…

Wow. Our paper on cutting nominal wages is called “If Wages Fell During a Recession” We ran an experiment in which workers could retaliate if they experienced a nominal wage cut. They did! They couldn’t smash their employer’s house, but some of the slighted workers dropped their effort level down to the minimum level which meant that their employer made no more money in the experiment.

In my talk at IUE (show notes here and YouTube video), I connect the wage cut paper to another experiment on beliefs. One wonders, considering how serious the consequences turned out to be for Henry Coulthurst, why he was not able to anticipate the backlash against wage cuts. Being wrong was costly for him.

People are not always good at appreciating how strongly others have become attached to their own reference points. That’s why the paper on beliefs is called “My Reference Point, Not Yours

Historical Price to Earnings Ratios By Industry

Getting long-run historical PE ratios of US stocks by industry seems like the kind of thing that should be easy, but is not. At least, I searched for an hour on Google, ChatGPT, and Bing AI to no avail.

I eventually got monthly median PEs for the Fama French 49 industries back to 1970 from a proprietary database. I share two key stats here: the average of median monthly industry PE 1970-2022, and the most recent data point from late 2022.

IndustryLong Run MeanEnd 2022
AERO12.1419.49
AGRIC10.759.64
AUTOS9.6517.52
BANKS10.3810.46
BEER15.2335.70
BLDMT12.0015.41
BOOKS12.9517.60
BOXES12.1810.69
BUSSV12.0713.03
CHEMS12.4019.26
CHIPS10.4817.47
CLTHS11.4510.94
CNSTR8.984.58
COAL8.042.92
DRUGS1.148.01
ELCEQ10.7817.85
FABPR10.2819.40
FIN11.1612.97
FOOD14.3025.03
FUN9.1021.06
GOLD3.18-5.95
GUNS11.505.05
HARDW7.9619.16
HLTH11.916.09
HSHLD12.6020.15
INSUR10.9516.33
LABEQ13.4625.18
MACH12.5120.27
MEALS13.8319.19
MEDEQ6.8127.64
MINES8.0616.27
OIL6.969.00
OTHER12.2027.68
PAPER12.5016.69
PERSV12.86-0.65
RLEST8.13-0.30
RTAIL12.268.58
RUBBR12.1112.81
SHIPS9.7917.42
SMOKE11.7417.79
SODA12.3832.09
SOFTW8.21-2.85
STEEL8.184.30
TELCM6.759.58
TOYS9.18-1.32
TRANS11.2513.11
TXTLS9.43-49.00
UTIL12.3417.41
WHLSL11.0813.13
Mean Industry Median10.5212.73

One obvious idea for what to do with this is to invest in industries that are well below their historical price, and avoid industries that are above it (not investment advice). Looking just at current PEs is ok, but a stock with a PE of 8 isn’t necessarily a good value if its in an industry that typically has PEs of 6.

By this metric, what looks overvalued? Money-losing industries (negative current earnings): Gold, Personal Services, Real Estate, Software, Toys, and Textiles. Making money but valuations 19+ above historical average: Medical Equipment, Beer, Soda. Most undervalued relative to history: Guns, Health, Coal, Construction, Steel, Retail (all 3+ below the historical average).

Of course, I don’t recommend blindly investing in these “undervalued” industries- not just for legal reasons, but because sometimes the market prices them low for a reason- that earnings are expected to fall. The industry may be in secular decline due to new types of competition (coal, steel, retail). Or investors may expect it to get hit with a big cyclical decline in an upcoming recession or rotation from the Covid goods/manufacturing economy back to services (guns, construction, steel, retail). Health services (as opposed to drugs and medical equipment) stands out here as the sector where I don’t see what is driving it to trade at barely half of its usual PE.

I’d still like to get data on long run market-cap weighted mean PE by industry, as opposed to the medians I show here. The best public page I found is Aswath Damodaran’s data page, which has a wide variety of statistics back to about 1999. Some of the current PEs he calculates are quite different from those in my source, another reason to tread carefully here. I’m not sure how much of this is mean vs median and how much is driven by different classification of which stocks fit in which industry category.

This gets at a big question for anyone trying to actually trade on this- do you buy single stocks, or industry ETFs? Industry ETFs make sense in principle (since we’re talking about industry level PEs overall) and also add built-in diversification. But the PE for the ETF’s basket of stocks likely differs from that of the industry as a whole. It would make more sense to compare the ETF’s current PE to its own historical PE, but most industry ETFs have very short track records (nothing close to the 53 years I show here). PE is also far from the only valuation metric worth considering.

All this gets complex fast but I hope the historical PE ratio by industry makes for a helpful start.

New Paper with Evidence that ChatGPT Hallucinates Nonexistent Citations

I posted a new working paper with systematic evidence for false citations when ChatGPT (GPT-3.5) writes about academic literature.

Buchanan, Joy and Shapoval, Olga, GPT-3.5 Hallucinates Nonexistent Citations: Evidence from Economics (June 3, 2023). Available at SSRN: https://ssrn.com/abstract=4467968 or http://dx.doi.org/10.2139/ssrn.4467968

Abstract: We create a set of prompts from every Journal of Economic Literature (JEL) topic to test the ability of a GPT-3.5 large language model (LLM) to write about economic concepts. For general summaries, ChatGPT can perform well. However, more than 30% of the citations suggested by ChatGPT do not exist. Furthermore, we demonstrate that the ability of the LLM to deliver accurate information declines as the question becomes more specific. This paper provides evidence that, although GPT has become a useful input to research production, fact-checking the output remains important.

Figure 2 in the paper shows the trend that the proportion of real citations goes down as the prompt becomes more specific. This idea has been noticed by other people, but I don’t think it has been documented quantitatively before.

We asked ChatGPT to cover a wide range of topics within economics. For every JEL category, we constructed three prompts with increasing specificity.

Level 1: The first prompt, using A here as an example, was “Please provide a summary of work in JEL category A, in less than 10 sentences, and include citations from published papers.”

Level 2: The second prompt was about a topic within the JEL category that was well-known. An example for JEL category Q is, “In less than 10 sentences, summarize the work related to the Technological Change in developing countries in economics, and include citations from published papers.”

Level 3: We used the word “explain” instead of “summarize” in the prompt, asking about a more specific topic related to the JEL category. For L we asked, “In less than 10 sentences, explain the change in the car industry with the rising supply of electric vehicles and include citations from published papers as a list. include author, year in parentheses, and journal for the citations.”

The paper is only 5 pages long, but we include over 30 pages in the appendix of the GPT responses to our prompts. If you are an economist who has not yet played with ChatGPT, then you might find it useful to scan this appendix and get a sense of what GPT “knows” about varies fields of economics.

If SSRN isn’t working for you, here is Also a Google Drive link to the working paper: https://drive.google.com/file/d/1Ly23RMBlim58a7CbmLwNL_odHSNRjC1L/view?usp=sharing

Previous iterations of this idea on EWED:

https://economistwritingeveryday.com/2023/04/17/chatgpt-as-intern/ Mike’s thoughts on what the critter is good for.

https://economistwritingeveryday.com/2023/01/21/chatgpt-cites-economics-papers-that-do-not-exist/  This is one of our top posts for traffic in 2023, since this is a topic of interest to the public.  That was January of 2023 and here we are in June today. It’s very possible that this problem will be fixed soon. We can log this bug now to serve as a benchmark of progress.

A check in and comparison with Bing: