Data & data linkage
On this page you can read more about:
According to the Oxford English Dictionary, Data is
'facts and statistics collected together for reference or analysis’.
What is ‘Big Data’?
Advances in storage and analytics mean we can now capture, store and work with many different types of data all at once.
‘Big data’ just means that the file(s) are too big to process on a normal spreadsheet or database. We need to use a combination of maths, statistics and computer science to get answers from these large, complex datasets.
You can see a list of data sources on our 'Access to data' page
Every time you access a health service, information (or ‘data’) is created. This information is confidential and controlled by strict privacy laws. However, by removing your personal information (name, full address, date of birth etc.) this patient data becomes de-identified and can no longer be traced back to you. Some people have volunteered to share this de-identified data with researchers.
(See our What about Security and Privacy page for more details)
DNA is another key source of data
By comparing the genetic information of hundreds-of-thousands of people, we hope to gain insights into the overlap between common mental health conditions and cardiovascular disease.
Many thousands of volunteers have kindly donated their DNA data to medical research. (Thank you).
Other data sources
Other data may come from sources like census records, birth/death/marriage records and other research projects who have signed up to be our partners.
Complex legal contracts control all of this data sharing
(See our What about Security and Privacy page for more details).
The theory is that the more data you have, the more you know. So, by comparing ever more data points, relationships that were previously hidden may now be revealed.
Data linkage allows our researchers to bring together information from a wide variety of sources (see above), to create a new, richer dataset.
Data linkage is done by assigning a number to each person and storing a set of links to all their records. Strict privacy rules ensure the security and confidentiality of the data and only the link is stored - the actual data is never brought together in one place. (See What about Security and Privacy [link] for more details).
Example of a linked data set:
PPID stands for Project Person Identifier – the number assigned to each person in the data.
Researchers receive the minimum amount of data possible, to allow them to complete their research.
|Dataset||PPID||Year of Birth||Gender||Year Admission||Length of stay||Postcode||Primary Diagnosis||Additional Diagnosis||Procedure Code||ARDRG|
The linked data sets we receive are hundreds of columns wide and hundreds-of-thousands of rows long. It would not be possible to look at this data and make sense of it by hand.
Instead our researchers use complex statistical programmes and machine learning techniques, which spot patterns much more quickly and reliably than humans ever could.
What is Machine Learning?
Machine learning means that computers use the data they are given to teach themselves how to do tasks, how to recognise patterns and how to make decisions. Machine learning makes it possible for computing systems to become ‘smarter’ as they encounter additional data.
For our researchers, this means they give the computer an example of the data as a starting point (training data). Once the computer has found patterns in this training data, it will know what to look for in any similar dataset it is given.
Our researchers then examine and interpret these data patterns, by comparing them to currently known facts about that health condition and patterns found by other research methods (e.g. data linkage).