Kaggle Wins for Data Sharing

I like to take existing datasets, clean them up, and share them in easier to use formats. When I started doing this back in 2022, my strategy was to host the datasets with the Open Science Foundation and share the links here and on my personal website.

OSF is great for allowing large uploads and complex projects, but not great for discovery. I saw several of my students struggle to navigate their pages to find the appropriate data files, and they seem to have poor SEO. Their analytics show that my data files there get few views, and most of the ones they get come from people who were already on the OSF site.

This year I decided to upload my new projects like County Demographics data to Kaggle.com in addition to OSF, and so far Kaggle is the clear winner. My datasets are getting more downloads on Kaggle than views on OSF. I’ve noticed that Kaggle pages tend to rank highly on Google and especially on Google Dataset Search. I think Kaggle also gets more internal referrals, since they host popular machine learning competitions.

Kaggle has its own problems of course, like one of its prominent download buttons only downloading the first 10 columns for CSV or XLSX files by default. But it is the best tool I have found so far for getting datasets in the hands of people who will find them useful. Let me know if you’ve found a better one.

Michigan Consumer Surveys: Individual-Response Data

I’ve now posted individual-level responses to the 1978-2025 Michigan Consumer Surveys to Kaggle in CSV and Stata formats. The University of Michigan’s Consumer Surveys are a widely followed source for data on consumer confidence and inflation expectations:

Their official site is good if you just want summary tables or charts like this:

But what if you want detailed crosstabs to see how sentiment differs for different groups, or microdata so that you can run regressions? With enough clicks you can get this from what UMich calls their “cross-section archive“. But it is pretty hidden, my student looking into this thought they just didn’t offer individual-level data; and even once you get their data, it is in an unlabelled CSV file with hard-to-understand variable names and codes. So I wanted to make it clear that the full data with all responses for all years is available, and if you use my Stata version it is even reasonably easy to understand (the code I adapted for labelling it is on OSF). Then you can run your regressions, or make charts like this:

The College-Only Covid Recovery

If you’re new here, a reminder that you can find other cleaned-up versions of popular datasets on my data page.

Real Time Crime Index

If you want to know how many pigs were killed in the United States yesterday, the USDA has the answer. But if you want to know how many humans were killed in the US this month, the FBI is going to need a year or two to figure it out. The new Real Time Crime Index, though, can tell you much sooner, by putting together the faster local agency reports:

Trends currently look good, though murders still aren’t quite back to pre-2020 levels.

In addition to graphing top-line state and national trends, the Real Time Crime Index also offers the option to download a CSV with city-level data going back to 2018. This seems like a great resource for researchers, worthy of adding to my page of most-improved datasets.

National Health Expenditure Accounts Historical State Data: Cleaned, Merged, Inflation Adjusted

The government continues to be great at collecting data but not so good at sharing it in easy-to-use ways. That’s why I’ve been on a quest to highlight when independent researchers clean up government datasets and make them easier to use, and to clean up such datasets myself when I see no one else doing it; see previous posts on State Life Expectancy Data and the Behavioral Risk Factor Surveillance System.

Today I want to share an improved version of the National Health Expenditure Accounts Historical State Data.

National Health Expenditure Accounts Historical State Data: The original data from the Centers for Medicare and Medicaid Services on health spending by state and type of provider are actually pretty good as government datasets go: they offer all years (1980-2020) together in a reasonable format (CSV). But it comes in separate files for overall spending, Medicare spending, and Medicaid spending; I merge the variables from all 3 into a single file, transform it from a “wide format” to a “long format” that is easier to analyze in Stata, and in the “enhanced” version I offer inflation-adjusted versions of all spending variables. Excel and Stata versions of these files, together with the code I used to generate them, are here.

A warning to everyone using the data, since it messed me up for a while: in the documentation provided by CMMS, Table 3 provides incorrect codes for most variables. I emailed them about this but who knows when it will get fixed. My version of the data should be correct now, but please let me know if you find otherwise. You can find several other improved datasets, from myself and others, on my data page.

Behavioral Risk Factor Surveillance System Survey: Now in Stata and CSV formats

The BRFSS Annual Survey is now available in Stata DTA and Excel-friendly CSV formats at my Open Science Foundation page.

The US government is great at collecting data, but not so good at sharing it in easy-to-use ways. When people try to access these datasets they either get discouraged and give up, or spend hours getting the data into a usable form. One of the crazy things about this is all the duplicated effort- hundreds of people might end up spending hours cleaning the data in mostly the same way. Ideally the government would just post a better version of the data on their official page. But barring that, researchers and other “data heroes” can provide a huge public service by publicly posting datasets that they have already cleaned up- and some have done so.

That’s what I said in December when I added a data page to my website that highlights some of these “most improved datasets”. Now I’m adding the Behavioral Risk Factor Surveillance Survey. The BRFSS has been collected by the Centers for Disease Control since the 1980s. It now surveys 400,000 Americans each year on health-related topics including alcohol and drug use, health status, chronic disease, health care use, height and weight, diet, and exercise, along with demographics and geography. It’s a great survey that is underused because the CDC only offers it in XPT and ASC formats. So I offer it in Stata DTA and Excel CSV formats here.

Let me know what dataset you’d like to see improved next.

Most Improved Data

The US government is great at collecting data, but not so good at sharing it in easy-to-use ways. When people try to access these datasets they either get discouraged and give up, or spend hours getting the data into a usable form. One of the crazy things about this is all the duplicated effort- hundreds of people might end up spending hours cleaning the data in mostly the same way. Ideally the government would just post a better version of the data on their official page. But barring that, researchers and other “data heroes” can provide a huge public service by publicly posting datasets that they have already cleaned up- and some have done so.

I just added a data page to my website that highlights some of these “most improved datasets”:

  • the IPUMS versions of the American Community Survey, Current Population Survey, and Medical Expenditure Panel Survey
  • The County Business Patterns Database, harmonized by Fabian Eckert, Teresa C. Fort, Peter K. Schott, and Natalie J. Yang
  • Code for accessing the Quarterly Census of Employment and Wages by Gabriel Chodorow-Reich
  • The merged Statistics of US Business, my own attempt to contribute

I hope to keep adding to this page as I find other good sources of unofficial/improved data, and as I create them (one of my post-tenure goals). See the page for more detail on these datasets, and comment here if you know of existing improved datasets worth adding, or if you know of needlessly terrible datasets you think someone should clean up.