excel – Economist Writing Every Day

State Demographics 1962-2024

July 31, 2025July 30, 2025James BaileyLeave a comment

I provide a simple, clean panel dataset of the historical demographics of US states here.

I made this state-year level dataset from the individual-level responses in the Current Population Survey‘s Annual Social and Economic Supplement. It includes key demographic variables commonly used as controls in regressions: age, race, sex, marital status, income, education, health insurance.

This was no great feat, in that I’m sure hundreds of researchers have generated very similar datasets before- but as far as I can tell, none of them share it publicly like this. I hope that posting this will save the time of researchers who would otherwise spend a couple hours re-doing the same work themselves, and offer easy-to-use Stata and Excel files for students who might not be able to make one themselves.

I’ve often seen my own students start an empirical project where they have state-level data on their dependent and independent variables of interest, then realize they need some control variables and find that they are surprisingly hard to get. Census’ website is difficult to navigate and mostly offers its data one year at a time. IPUMS is great for getting individual-level responses; they have a tool that can export state-level data directly, but it is not built or advertised in a way that students end up actually using it for that, and it is limited to exporting 8000 cells of data at a time (the dataset I share has 53,856). The National Welfare Data is what I have been recommending to students for state-level controls, but its year coverage is more limited (1980-2023), and it contains welfare variables that are mostly extraneous for this. County-level data is available back to 1969, which could be aggregated to the state level, but it doesn’t have income or education.

Disclaimers: prior to 1977, some states are missing or are combined with nearby states (the data makes clear which ones). Some variables change how they are coded by Census / IPUMS over time. Age in particular sees several big changes to its universe and its top-coding, and race gets recoded in 2003. Therefore, this is not always a good dataset for measuring national trends in a variable over time, but it should still work well for making comparisons across states in a given year. If you are using this data in a regression, I strongly recommend controlling for year fixed effects to mitigate this issue. If you want to double-check my cleaning code, it is available here.

Once again, the state demographic data is available here. If you think there is a better source out there for historical state demographics, let us know below. See my data page for more cleaned-up versions of public datasets.

Excel’s Weird (In)Convenience: COUNTIF, AVERAGEIF, & STDEVIF

December 20, 2024December 23, 2024Zachary Bartsch1 Comment

Excel is an attractive tool for those who consider themselves ‘not a math person’. In particular, it visually organizes information and has many built-in functions that can make your life easier. You can use math if you want, but there are functions that can help even the non-math folks

If you are a moderate Excel user, then you likely already know about the AVERAGE and COUNT functions. If you’re a little but statistically inclined, then you might also know about the STDEV.S function (STDEV is deprecated). All of these functions are super easy and only have one argument. You just enter the cells (array) that you want to describe, and you’re done. Below is an example with the ‘code’ for convenience.

=COUNT(A2:A21)
=AVERAGE(A2:A21)
=STDEV.S(A2:A21)

If you do some slightly more sophisticated data analysis, then you may know about the “IF” function. It’s relatively simple; if a proposition is true (such as a cell value condition), then it returns a value. If the proposition is false, then it returns another value. You can even create nested “IF”s in which a condition being satisfied results in another tested proposition. Back when excel had more limited functions, we had to think creatively because there was a limit to the number of nested “IF” functions that were permitted in a single cell. Prior to 2007, a maximum of seven “IF” functions were permitted. Now the maximum is 64 nested “IF”s. If you’re using that many “IF”s, then you might have bigger problems than the “IF” limitations.

Another improvement that Excel introduced in 2019 was easier array arguments. In prior versions of Excel, there was some mild complication in how array functions must be entered (curly brackets: {}). But now, Excel is usually smart enough to handle the arrays without special instructions. Subsequently, Excel has introduced functions that combine the array features with the “IF” functions to save people keystrokes and brainpower.

Looking at the example data we see that there is an identifier that marks the values as “A” or “B”. Say that you want to describe these subgroups. Historically, if you weren’t already a sophisticated user, then you’d need to sort the data and then calculate the functions for each subgroup’s array. That’s no big deal for small sets of data and two possible ID values, but it’s a more time-consuming task for many possible ID values and multiple ID categories.

The early “IF” statements allowed users to analyze certain values of the data, such as those that were greater than, less than, or equal to a particular value. But, what if you want to describe the data according to criteria in another column (such as ID)? That’s where Excel has some more sophisticated functions for convenience. However, as a general matter of user interface, it will be clear why these are somewhat… awkward.

Continue reading →

Cheers to Sumproduct!

October 28, 2022October 27, 2022Zachary Bartsch2 Comments

I teach macroeconomics, finance, and other things.

Often, I use Excel to complete repetitive calculations for my students. The version that I show them is different from the version that I use. They see a lot more mathematical steps displayed in different cells, usually with a label describing what it is. But when I create an answer calculator or work on my own, I usually try to be as concise as possible, squeezing what I can into a single cell or many fewer cells. That’s what brings me to to the sumproduct excel function that I recently learned. It’s super useful I’ll illustrate it with two examples.

Example 1) NGDP

One way to calculate NGDP is to sum all of the expenditures on the different products during a time period. The expenditures on a good is simply the price of the good times the quantity that was purchased during the time period. The below image illustrates an example with the values on the left, and the equations that I used on the right. That’s the student version. There is an equation for each good which calculates the total expenditure on the individual goods. Then, there is a final equation which sums the spending to get total expenditures, or NGDP.

Continue reading →

Everyone’s an Expert: Easy Data Maps in Excel

June 24, 2022June 23, 2022Zachary Bartsch1 Comment

I love data, I love maps, and I love data visualizations.

While we tend not to remember entire data sets, we often remember some patterns related to rank. Speaking for myself anyway, I usually remember a handful of values that are pertinent to me. If I have a list of data by state, then I might take special note of the relative ranking of Florida (where I live), the populous states, Kentucky (where my parents’ families live), and Virginia (where my wife’s family lives). I might also take special note of the top rank and the bottom rank. See the below table of liquor taxes by State. You can easily find any state that you care about because the states are listed alphabetically.

A ranking is useful. It helps the reader to organize the data in their mind. But rankings are ordinal. It’s cool that Florida has a lower liquor tax than Virginia and Kentucky, but I really care about the actual tax rates. Is the difference big or small? Like, should I be buying my liquor in one of the other states in the southeast instead of Florida? Without knowing the tax rates, I can’t make the economic calculation of whether the extra stop in Georgia is worth the time and hassle. So, the most useful small data sets will have both the ranking and the raw data. Maybe we’re more interested in the rankings, such as in the below table.

But, tables take time to consume. A reader might immediately take note of the bottom and top values. And given that the data is not in alphabetical order, they might be able to quickly pick out the state that they’re accustomed to seeing in print. But otherwise, it will be difficult to scan the list for particular values of interest.

Continue reading →