You are browsing the archive for definition.

What is content mining?

- March 27, 2014 in Research

It’s simple really – you can break it down into its two constituent parts:

  1. Content

In this context, content can be text, numerical data, static images such as photographs, videos, audio, metadata or any digital information, and/or a combination of them all. It is a deliberately vague term, encompassing all types of information. Do not confuse content for the medium by which content is delivered; content is independent of medium in the digital sphere.

  1. Mining

In this context, mining refers to the large-scale of information extraction from your target content. If you extract information from just one or two items of content – that’s ‘data extraction’. But if you extract information from thousands of separate items of content – that’s ‘mining’.

content mining can involve multiple types of content!

It is important to emphasise that the phrase ‘text & data mining’ refers only to the mining a subset of the types of content one may wish to mine: text & data. Content mining is thus a more useful generic phrase that encompasses all the types of content one may wish to mine.

For my postdoc I’m using content mining to extract phylogenetic information and associated metadata from the academic literature. To be specific, the content I’m mining is text AND images.

The text of an academic paper contains much of the metadata about a phylogenetic analysis I want to extract, whilst the actual result of a phylogenetic analysis is unfortunately mostly only published as an image in papers (completely non-textual) – thus I need to leverage mining techniques for multiple content types.

Most mining projects tend to just mine one type of content, with text mining being perhaps the most common. An example question one might attempt to answer using text mining is: How many academic journal articles acknowledge funding support from the Wellcome Trust?

Peter Murray-Rust has many more different possible uses for content mining on his blog: http://blogs.ch.cam.ac.uk/pmr/2014/02/27/101-uses-for-content-mining/

In general, content mining is still an emerging and under-utilized technique in the research sphere – there is much work still to be done and billions of questions still to be answered. It isn’t the one-stop solution to everything, but for appropriate questions it can be the most powerful & comprehensive approach available. Get mining now!

Some suggested tutorials & resources you might want to start with:

* Mining with Weka tools – http://sentimentmining.net/weka/
* Mining with R tools – http://www.r-bloggers.com/text-mining-the-complete-works-of-william-shakespeare/
* Open resources – you can download ALL of the English language Wikipedia (or any other language) here: http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia

* See also Project Gutenberg for literary resources to mine: http://www.gutenberg.org/