I was at OpenDataCamp’s Open Data Cafe (#odcafe) this lunchtime and whilst short it managed to squeeze in a lot of ideas worth of further consideration. One session I attended was discussing the utility of open data and whether or not it was worth continuing to do. This session was clearly focused on open data being published by public bodies (the session proposer was a civil servant) but it made me start thinking about the environmental costs of open data, especially data which is only likely to be of interest and/or use to small niche groups, e.g. a lot of data generated in academic research, and at what point the cost of making the data fully open outstrips the idealogy of the open movement.
[What follows was written quickly and likely contains some errors and/or only partly thought through thoughts!]
One of the central tenets of open data is that it is available on demand, i.e. “data available on request” is not open data. In order for data to be on demand it must be stored on a server that is connected to the network 24/7. This uses energy. Not all open datasets are in heavy demand, many I suspect get downloaded at most a handful of times in their lifetime. Is there a point at which the energy use of making these data open becomes ethically unacceptable? Would it be better to instead retain an open catalogue record of the dataset, store the dataset itself on an offline server, and make it available on request only? Would the energy savings trump the lack of immediacy for the end user in the larger ethical framework?
Large reference libraries already do this with their physical holdings. If you go into the National Library of Scotland [NLS] there are shelves that you can browse, taking any book off them for further study. But there are far far more books kept behind the scenes. Their details are in the catalogue so you can find out about them, but if you want to actually read one of these books you must make a request and someone will go and fetch it for you. There are a number of reasons for keeping the bulk of the holdings backstage, the number one simply being that to keep them all on the public shelves would require an infeasibly large building with all the infeasibly large costs associated with it. As more and more digital data is created, and more and more of it is made open, is this reference library model one that needs to be considered for Open Data?
There are some issues with this that immediately spring to mind:
- access requests going unanswered, this is already a known anecdotal issue with data access statements in academic papers. Ask any group of researchers and at least one will likely have a story about emailing an author to ask for data that a paper has claimed is available on request and simply never getting a reply. To go back to the reference library analogy the NLS has a policy about the maximum length of time you will have to wait for a book in storage. An Open Data repository could have a similar policy although this would obviously require some staffing, potentially additional staff to what they currently have.
- Another selling point of Open Data is that it is (theoretically) anonymous. You shouldn’t have to register and supply any identifying details to download an open dataset. But would it not be possible to create some sort of buffer layer between the data holder and the requester to ensure that the former never finds out the identity of the latter? (I genuinely don’t know the answer to this not being a tech developer, maybe it’s not?)
One of the ways I tried to check my thinking here was to see if I could find any usage statistics from research data repositories. These would hopefully tell me if I’m off the mark in assuming most datasets are rarely downloaded. Ironically I didn’t have much luck. The most I could find was through the JISC IRUS-UK service (which I could only access as I work in a UK HEI) which just shows monthly totals for what I’m assuming are downloads but it’s really not clear! But even if I’m assuming correctly it doesn’t tell me if these downloads are evenly spread across a repository’s holdings (I doubt it) or if they’re heavily skewed by one or more big hitters (I strongly suspect so). I’d hope that the repositories themselves do collect these statistics but if they do they’re not being made open (if anyone knows of any that are open please let me know, I only had a quick search as I wanted to get this online).
I realised as I was writing this that I wrote one of my LIS pgdip essays on some related issues (energy use in digital libraries and archives) so it’s clearly something that’s been on my mind for a couple of years at least. I also realise that posting a blog post that will be read by at most a handful of people and then live on consuming space and energy on the wordpress servers is potentially hypocritical of me given the issue I’m raising but maybe that’s the answer to my question, that the small amount of energy used to keep a small dataset (or blog post) online and accessible on demand is worth paying if even one person finds it useful? I really don’t know!