Triumph of the Data Hoarders 2: The Institutions

Datasets can be pulled offline for all sorts of reasons. As I wrote in February, this shows the value of being a data hoarder– just downloading now any data you think you might want later:

Several major datasets produced by the federal government went offline this week…. This serves as a reminder of the value of redundancy- keeping datasets on multiple sites as well as in local storage. Because you never really know when one site will go down- whether due to ideological changes, mistakes, natural disasters, or key personnel moving on.

The US Federal government shutdown this month provides another reminder of this. So far most datasets are still up, but I’ve seen some availability issues:

The good news is that a number of institutions have stepped up in 2025 to host at-risk datasets (joining those like IPUMSNBER, and Archive.org that have been hosting datasets for many years, but are scaling up to meet the moment):

  • Restore CDC hosts all CDC data as it was in January 2025.
  • The Data Rescue Project provides tools and suggestions for how other institutions can save data at scale, plus links to other projects.

Triumph of the Data Hoarders

Several major datasets produced by the federal government went offline this week. Some, like the Behavioral Risk Factor Surveillance Survey and the American Community Survey, are now back online; probably most others will soon join them. But some datasets that the current administration considers too DEI-inflected could stay down indefinitely.

This serves as a reminder of the value of redundancy- keeping datasets on multiple sites as well as in local storage. Because you never really know when one site will go down- whether due to ideological changes, mistakes, natural disasters, or key personnel moving on.

External hard drives are an affordable option for anyone who wants to build up their own local data hoard going forward. The Open Science Foundation site allows you to upload datasets up to 50 GB to share publicly; that’s how I’ve been sharing cleaned-up versions of the BRFSS, state-levle NSDUH, National Health Expenditure Accounts, Statistics of US Business, and more. If you have a dataset that isn’t online anywhere, or one that you’ve cleaned or improved to the point it is better than the versions currently online, I encourage you to post it on OSF.

If you are currently looking for a federal dataset that got taken down, some good places to check are IPUMS, NBER, Archive.org, or my data page. PolicyMap has posted some of the federal datasets that seem particularly likely to stay down; if you know of other pages hosting federal datasets that have been taken down, please share them in the comments.

IPUMS Data Intensive Workshop & Conference

I just returned from the Full Count IPUMS data workshop at the Data-Intensive Research Conference that was hosted by the Network on Data Intensive Research on Aging and IPUMS. The theme of this conference was “Linking Records”.

It was the best workshop and conference that I’ve ever attended. I’d attended the conference remotely in the past. But attending the workshop was exceptional. Myself and about 20 other people were flown to the Minneapolis Population Center and put up in a hotel during our stay (that made the conference a low-stress affair). The whole workshop was well organized, the speakers built on one another’s content, and there was a hands-on lab for us to complete. I felt my human capital growing by the hour.  

Continue reading

Almost Observable Human Capital

I’ve written about IPUMS before. It’s great. Among individual details are their occupations and industry of their occupation. That’s convenient because we can observe how technology spread across America by observing employment in those industries. We can also identify whether demographic subgroups differed or not by occupation. There’s plenty of ways to slice the data: sex, race, age, nativity, etc.

But what do we know about historical occupations and what they entailed? At first blush, we just have our intuition. But it turns out that we have more. There is a super boring 1949 report published by the Department of Labor called the “Dictionary of Occupational Titles”. The title says it all. But, the DOL published another report in 1956 that’s conceptually more interesting called “Estimates of Worker Trait Requirements for 4,000 Jobs as Defined in the Dictionary of Occupational Titles: An Alphabetical Index”.  The report lists thousands of occupations and identifies typical worker aptitudes, worker temperaments, worker interests, worker physical capacities, and working conditions. Below is a sample of the how the table is organized:

Continue reading

Most Improved Data

The US government is great at collecting data, but not so good at sharing it in easy-to-use ways. When people try to access these datasets they either get discouraged and give up, or spend hours getting the data into a usable form. One of the crazy things about this is all the duplicated effort- hundreds of people might end up spending hours cleaning the data in mostly the same way. Ideally the government would just post a better version of the data on their official page. But barring that, researchers and other “data heroes” can provide a huge public service by publicly posting datasets that they have already cleaned up- and some have done so.

I just added a data page to my website that highlights some of these “most improved datasets”:

  • the IPUMS versions of the American Community Survey, Current Population Survey, and Medical Expenditure Panel Survey
  • The County Business Patterns Database, harmonized by Fabian Eckert, Teresa C. Fort, Peter K. Schott, and Natalie J. Yang
  • Code for accessing the Quarterly Census of Employment and Wages by Gabriel Chodorow-Reich
  • The merged Statistics of US Business, my own attempt to contribute

I hope to keep adding to this page as I find other good sources of unofficial/improved data, and as I create them (one of my post-tenure goals). See the page for more detail on these datasets, and comment here if you know of existing improved datasets worth adding, or if you know of needlessly terrible datasets you think someone should clean up.