The big deal with big data

The university has been discussing big data, which is defined as…well, it depends who you ask. Forbes magazine lists 12 different definitions, including the best one I’ve seen so far:

(#10) The merger of Madame Olympe Maxime and Lieutenant Commander Data.

Ho-hum. I’m afraid, in the absence of anything better, I’m going to go with Wikipedia’s definition, as this seems to be the one that fits with what we understood as ‘big data’ at last week’s conference:

Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate.

Having said that, it wasn’t always entirely clear how big data was being defined at the conference, and a lot of what was talked about could have applied to data in general, not just massive data sets. It was interesting to hear what was said in the light of having attending the CPD25 event on engaging and supporting researchers, where Glenn Cumiskey from the British Museum talked about digital preservation. He mentioned the 5 Vs of data, which were also part of Maria Kalli‘s paper on big data and the undergraduate curriculum. She defined big data as:

a data set with characteristics that for a particular process at a given point in time cannot be effectively [perused] using traditional [analysis methods]

Which is interesting in that she doesn’t mention the size of the data set at all only that it has problematic characteristics.

Dr Ali Swanson from the Zooniverse citizen science project run by the University of Oxford kicked off the show with a fascinating and engaging talk about the work of Zooniverse, where scientists enlist members of the public to help with the gathering or analysis of very large datasets. There are some amazing projects going on, including Penguin Watch, which I heard several people say they were going to take a look at after the conference!


Dr Swanson was keen to emphasise that these methods are not to do with education and engagement, but are a research tool (although people are educated and engaged as a result of participating as researchers). It was interesting to learn about how the scientists check that data is correct. This is done by aggregating data into “consensus” using algorithms – using this method it has been shown that ‘the public’ are correct 97% of the time when compared to answers given by experts, which is pretty good as far as I can tell.

And now for a little break to look at

That’s what we should all be doing with big data – making it into lovely graphics!

But back to reality. Maria Kalli was up next, as well as trying to define big data, she made a good case for the inclusion of statistics and data being at the heart of science-based undergraduate curricula, particularly as there are currently (and likely to be in the future) lots of jobs in big data. She also talked about the importance of students being able to extract and filter information, something very relevant to the work of library services – perhaps this is an area in which we can become more involved in the Business School curriculum? (I realise Maria was probably talking about mathematical and statistical information but the principles are similar – it might be a way of ‘selling’ information literacy to business students?).

Dr Kalli was followed by Mr Jonny Greatrex, a Proper Northerner, who spoke about participatory journalism – asking people what they want to know about. This style of journalism has been pioneered by (e.g.) Jennifer Brandel of Hearken fame. Interesting, yet counter-intuitive; it consists of asking the audience to suggest  questions they want answering, then asking them to vote on which question they want answered, and then involving the person who asked the question in the reporting process. ‘News’ content produced using this approach generates more page views and greater engagement time.

I thought that this sort of approach could be really useful in terms of our engagement with library users – it could/should help us give them the services they actually need, rather than those we think they need. It’s not just a question of asking them, it’s also engaging them in the process of deciding upon and even in the implementation and/or delivery of those services.

Dr Dan Donoghue spoke about principle components analysis as an approach to handling big data, followed by Professor Jim Griffin, from the University of Kent with his paper on big data and statistics. He argued that (usually) the more data you have the better data modelling you can do, and the shorter the interval between observations the more information you have overall, but the increases in information become smaller the shorter the interval. My example of this [so if you think it’s wrong don’t blame Professor Griffin] is that if you watch the news once a day you will see less news than if you watch it every 5 minutes but the information you gain will be about the same [unless something massive happens when you’re not watching]. The more data you have, the more complicated the structure of your analysis….is what I think he said, but  it was getting towards lunchtime and statistics are not my strong point.

After lunch, I took a break to do some work and then went back to the conference for the panel discussion at the end of the afternoon. It was very interesting to hear the questions that had been sent in, and of course the responses from the panel, which showed a mix of views about data (big and otherwise), what it is and how it could be used across the university. This is definitely something we as library staff need to get involved in, if only to ensure that those in charge know what services we offer and the kinds of knowledge and skills we can contribute – and of course to enhance the services provided to students and staff across the university.

I was particularly thinking about research data and the possible implementation of a CRIS (Current Research Information System) at some point in the mid-future, but this wasn’t specifically mentioned at the conference. I would think that we will have to think seriously about what to do with our research data at some point soon(ish) and it will be interesting to see what part library services has to play in this (if any). Hopefully, it will be something we can contribute to, if only in terms of the CRIS’s relationship with the repository.

All in all, it was a worthwhile, if slightly brain-tiring day. If only we agreed on what big data actually is…

And remember:

But that’s a whole other blog post.


One thought on “The big deal with big data

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s