What data should we store forever? Or a new superpower for sufficient statistics!

I'm getting geeked up about my new role at the Fred Hutchinson Cancer Research Center. One part of the new job the nerd in me is particularly excited about is how to make data governance easier/better/more useful at an institutional scale. As anyone who has spent time in genomics, or really any other field that has been hit by the data hurricane the last decade or so knows, the amount of data we are generating and storing just keeps going up over time.

One interesting problem that sits at the intersection of my two fields (data and statistics) is the question:

What data should we store forever?

The first time I thought about this question it was purely about cost. In my genomic work a typical "raw" sequencing data file for a deep RNA-sequencing run might be (these days) 20GB of data. The equivalent data just storing the read counts and the junction counts might be only 100MB of data. So for one data file of "raw" data we might be able to store 200 or so "processed" data files. That's a pretty dramatic savings in storage space and cost!

The question is, do the processed data represent "all of the information" we would want to keep about the sequencing data. For some applications, it is definitely all you need. Moreover, if that's all you need, the processed data are *much* easier to work with. This insight is what inspired the recount project. If you summarize the data they can be both easier to store and easier to use - as long as you have all the information you want in the summaries.

The statistical term for a summary that captures "all of the relevant information" from a data source for a particular goal is a sufficient statistic. Sufficient statistics have typically been defined in terms of a single parameter in a statistical model. But I've always liked to abuse the term a bit and think about the "sufficient statistic" for a class of analyses, which might include several different models.

I've been thinking more and more about this idea not just in the context of genomics and not just in the context of cost saving. If we want to use, say, electronic health records to help identify new patient populations or treatment effects - what is the "sufficient statistic" that would allow us to maximally share the information we need to answer those questions, while still protecting patient privacy? You could both save cost and also improve privacy if you get the "sufficient statistic" right.

I haven't formalized my definition of "sufficient statistic for a category of analyses yet" (grad student project alert!) but our previous work has been really instructive. There is no way we could have afforded to share all of the raw data from hundreds of thousands of sequencing experiments (or done so respecting privacy concerns). But we *can* share the summarized metrics in a responsible and cost effective way.

It is particularly important to think through new ideas like this in light of the NIH's data sharing guidance and the push to cloud computing . We need a way to share data that (a) makes it possible to accelerate scientific discoveries, (b) isn't onerous in cost either to create or maintain and (c) can protect privacy while supporting open scientific inquiry.

These trends could drive a whole new life for sufficient statistics - it could end up being a great way to think about long term data management on the cloud. I'm excited to see how we can scale this to support institutional data sharing policies!