Kaggle Wins for Data Sharing

I like to take existing datasets, clean them up, and share them in easier to use formats. When I started doing this back in 2022, my strategy was to host the datasets with the Open Science Foundation and share the links here and on my personal website.

OSF is great for allowing large uploads and complex projects, but not great for discovery. I saw several of my students struggle to navigate their pages to find the appropriate data files, and they seem to have poor SEO. Their analytics show that my data files there get few views, and most of the ones they get come from people who were already on the OSF site.

This year I decided to upload my new projects like County Demographics data to Kaggle.com in addition to OSF, and so far Kaggle is the clear winner. My datasets are getting more downloads on Kaggle than views on OSF. I’ve noticed that Kaggle pages tend to rank highly on Google and especially on Google Dataset Search. I think Kaggle also gets more internal referrals, since they host popular machine learning competitions.

Kaggle has its own problems of course, like one of its prominent download buttons only downloading the first 10 columns for CSV or XLSX files by default. But it is the best tool I have found so far for getting datasets in the hands of people who will find them useful. Let me know if you’ve found a better one.

County Demographic Data: A Clean Panel 1969-2023

Whenever researchers are conducting studies using state- or county-level data, we usually want some standard demographic variables to serve as controls; things like the total population, average age, and gender and race breakdowns. If the dataset for our main variables of interest doesn’t already have this, we go looking for a new dataset of demographic controls to merge in; but it has always been surprisingly hard to find a clean, easy-to-use dataset for this. For states, I’ve found the University of Kentucky’s National Welfare Database to be the best bet. But what about counties?

I had no good answer, and the best suggestion I got from others was the CDC SEER data. As so often, the government collected this impressively comprehensive dataset, but only releases it in an unusable format- in this case only as txt files that look like this:

I cleaned and reformatted the CDC SEER data into a neat panel of county demographics that look like this:

I posted my code and data files (CSV, XLSX, and DTA) on OSF and my data page as usual. I also posted the data files on Kaggle, which seems to be more user-friendly and turns up better on searches; I welcome suggestions for any other data repositories or file formats you would like to see me post.

HT: Kabir Dasgupta