Kaggle Wins for Data Sharing

I like to take existing datasets, clean them up, and share them in easier to use formats. When I started doing this back in 2022, my strategy was to host the datasets with the Open Science Foundation and share the links here and on my personal website.

OSF is great for allowing large uploads and complex projects, but not great for discovery. I saw several of my students struggle to navigate their pages to find the appropriate data files, and they seem to have poor SEO. Their analytics show that my data files there get few views, and most of the ones they get come from people who were already on the OSF site.

This year I decided to upload my new projects like County Demographics data to Kaggle.com in addition to OSF, and so far Kaggle is the clear winner. My datasets are getting more downloads on Kaggle than views on OSF. I’ve noticed that Kaggle pages tend to rank highly on Google and especially on Google Dataset Search. I think Kaggle also gets more internal referrals, since they host popular machine learning competitions.

Kaggle has its own problems of course, like one of its prominent download buttons only downloading the first 10 columns for CSV or XLSX files by default. But it is the best tool I have found so far for getting datasets in the hands of people who will find them useful. Let me know if you’ve found a better one.

National Survey of Children’s Health Backup

The NSCH is the latest casualty of the new administration taking down major datasets from government websites. Between Archive.org and what I had downloaded for old projects, I was able to get all the 2016-2023 topical NSCH files and post them on an Open Science Foundation page.

I took this as a chance to improve the data- the government previously only made the topical Public Use Files available in SAS and Stata formats one year at a time, so I added a merged version for all available years in both Stata and Excel formats.

I hope and expect that the National Survey Children’s Health will be back up at official websites soon. But I expect that other datasets will be taken down permanently, so now is the time to download what you think you might need and add it to your data hoard– especially if you want anything from the Department of Education.

Triumph of the Data Hoarders

Several major datasets produced by the federal government went offline this week. Some, like the Behavioral Risk Factor Surveillance Survey and the American Community Survey, are now back online; probably most others will soon join them. But some datasets that the current administration considers too DEI-inflected could stay down indefinitely.

This serves as a reminder of the value of redundancy- keeping datasets on multiple sites as well as in local storage. Because you never really know when one site will go down- whether due to ideological changes, mistakes, natural disasters, or key personnel moving on.

External hard drives are an affordable option for anyone who wants to build up their own local data hoard going forward. The Open Science Foundation site allows you to upload datasets up to 50 GB to share publicly; that’s how I’ve been sharing cleaned-up versions of the BRFSS, state-levle NSDUH, National Health Expenditure Accounts, Statistics of US Business, and more. If you have a dataset that isn’t online anywhere, or one that you’ve cleaned or improved to the point it is better than the versions currently online, I encourage you to post it on OSF.

If you are currently looking for a federal dataset that got taken down, some good places to check are IPUMS, NBER, Archive.org, or my data page. PolicyMap has posted some of the federal datasets that seem particularly likely to stay down; if you know of other pages hosting federal datasets that have been taken down, please share them in the comments.

Behavioral Risk Factor Surveillance System Survey: Now in Stata and CSV formats

The BRFSS Annual Survey is now available in Stata DTA and Excel-friendly CSV formats at my Open Science Foundation page.

The US government is great at collecting data, but not so good at sharing it in easy-to-use ways. When people try to access these datasets they either get discouraged and give up, or spend hours getting the data into a usable form. One of the crazy things about this is all the duplicated effort- hundreds of people might end up spending hours cleaning the data in mostly the same way. Ideally the government would just post a better version of the data on their official page. But barring that, researchers and other “data heroes” can provide a huge public service by publicly posting datasets that they have already cleaned up- and some have done so.

That’s what I said in December when I added a data page to my website that highlights some of these “most improved datasets”. Now I’m adding the Behavioral Risk Factor Surveillance Survey. The BRFSS has been collected by the Centers for Disease Control since the 1980s. It now surveys 400,000 Americans each year on health-related topics including alcohol and drug use, health status, chronic disease, health care use, height and weight, diet, and exercise, along with demographics and geography. It’s a great survey that is underused because the CDC only offers it in XPT and ASC formats. So I offer it in Stata DTA and Excel CSV formats here.

Let me know what dataset you’d like to see improved next.