On data linkage: interview with Joseph Sakshaug

Professor Joe Sakshaug

With the increasing availability of data from multiple sources new research challenges appear. Two key ones are how to ask for consent to do data linkage and how to do probabilistic matching of cases from different data sources. Here to talk about these topics is Professor Joseph Sakshaug, a leading expert in this field.

 Joe is Acting Head of the Statistical Methods Research Department at the IAB and Professor in the School of Social Sciences at the University of Mannheim. Before, he received a PhD in survey methodology from the University of Michigan and worked at the University of Mannheim and University of Manchester.

What is record linkage and why should we care?

Record linkage is the process of bringing together records that correspond to the same unit across different data sources. Depending on the application, a unit can refer to an individual, family, household, or business. Record linkage is used for administrative purposes (e.g. identifying and removing duplicate entries in a database or sampling frame) as well as for research-related purposes (e.g. increasing the number of data elements for a set of units). In survey research, record linkage is often used to link interview data with administrative data sources (e.g. social security records, medical records), enabling researchers to answer complex research questions which would otherwise be difficult (or impossible) to answer using interview data alone.

There is great demand among researchers for linked-data products and several high-profile committees and organizations, including the US National Academies of Sciences, Engineering, and Medicine and the US Commission on Evidence-Based Policymaking, have endorsed record linkage as a way to improve official statistics and meet future demands of policymakers.

It’s important that respondents are given the opportunity to make an informed decision on how their information will be used, shared, and protected.

How/why did you start working in this area?

I’ve always been interested in bringing together multiple data sources in order to study survey methodological and data quality issues. Record linkage is particularly useful for studying error sources in the Total Survey Error framework, e.g. nonresponse and measurement error.

During my graduate studies, I got involved in a record check study of university alumni members who answered questions about their academic experience that could be validated through university records that were linked to the full sample. I found it to be a fascinating way of using linked-data to study nonresponse and measurement error simultaneously. But it also got me thinking about the quality of linked-data and the possible ways in which errors can arise during the linkage process. For example, not all linkage applications have access to a unique identifier as we did in the alumni study and instead rely on error-prone identifiers (e.g. names, addresses) to match units. Also, not all respondents give permission (or consent) to the linkage of their data. Each of these issues can affect the quality of the final linked-data product.

Why is linkage consent important in surveys? How is it different from other types of consent?

It’s important that respondents are given the opportunity to make an informed decision on how their information will be used, shared, and protected. Depending on the jurisdiction and type of data being shared and linked, linkage consent may be required by law. In Europe, where I am based, the new EU General Data Protection Regulation has prompted survey organizations, especially in the private sector, to re-evaluate how they obtain informed consent and whether consent is necessary for a specific linkage and data sharing applications. Beyond the legal regulations, informed consent may be necessary for ethical reasons, as decided by an ethics committee or IRB.

Regarding placement, the recommendation is to ask for linkage consent as early as possible in the questionnaire.

Linkage consent is similar to other types of consent (e.g. participation in a research study, collection of biomeasures) in the sense that it is a mechanism that allows for the collection of additional information on a given unit that would otherwise not be possible to collect for legal, ethical, or access reasons. But consent can be collected in a variety of different ways (e.g. via signature, verbal agreement, provision of an ID number) depending on the specific application and the type of information being collected.

You have done research in the area of data linkage consent. What are the current recommendations regarding asking for data linkage consent in surveys?

The empirical evidence suggests that there are things survey designers can do to optimize linkage consent rates. I will mention only a few here. In general, experiments have found that the placement of the linkage consent question and, to a lesser extent, the wording of the question can influence consent rates. Regarding placement, the recommendation is to ask for linkage consent as early as possible in the questionnaire. While most surveys ask for linkage consent towards the end of the interview, this tends to be suboptimal from a consent rate perspective (see e.g. here, here, and here). In general, wording effects tend to be less consistent and less important than an optimal placement.

Framing the linkage request in terms of losses instead of gains has been shown to yield modest improvements in the consent rate (see e.g. here), though this effect may vary by placement and mode as we found in a study forthcoming in Public Opinion Quarterly’s special issue in honor of Eleanor Singer. It’s also worth noting that linkage consent rates are generally much lower in self-administered (versus interviewer-administered) survey modes (see e.g. here and here).

What are some big open research topics in the field of data linkage?

Privacy and disclosure risk are important issues in data linkage research. While data linkage can be very useful for answering complex research questions, it also increases the risk of reidentification. Methods which ameliorate these risks without harming data quality are in high demand.

Procedures for sharing and linking data sources belonging to different agencies with different data sharing policies is also an active area of research. As noted earlier, informed consent is important in many data linkage applications, but it is unclear how truly informed respondents are about the linkage process. An open area of research is conveying the linkage consent request in a way that ensures respondents are adequately informed about relevant aspects of the linkage. Addressing this issue in the context of linking interview data with digital trace data is another burgeoning area of research.

What are some key skills researchers need to know in order to do record linkage?

For the beginner, it helps to have knowledge of basic statistical concepts as well as a basic understanding of statistical software. We use R in our introductory record linkage course offered through the International Program in Survey and Data Science. In the course, we cover many key skills for record linkage, including obtaining linkage consent, pre-processing routines to improve the quality of linkage identifiers, blocking methods, specific linkage techniques (e.g. rule-based, distance-based, probabilistic record linkage), and additional software tools. Documentation skills are also useful for purposes of reproducibility as the record linkage process typically entails many intermediate decisions.

While data linkage can be very useful for answering complex research questions, it also increases the risk of reidentification. Methods which ameliorate these risks without harming data quality are in high demand.

Could you share 3 key references to start learning about record linkage?

I’ve included links to relevant references throughout this interview. Additional references I would recommend include the book: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection by Peter Christen, which is the primary textbook we use in our IPSDS course. For an overview of data linkage quality and informed consent issues I would refer to our chapter in the Total Survey Error in Practice edited volume. For step-by-step descriptions of actual record linkage applications, the following IAB case studies might be useful: here, here, and here.

Where can people go to find out more about your work?

My IAB profile is usually up-to-date. I’m also on ResearchGate and have a personal website.


Stay up to date with the latest survey methods news. Subscribe to the mailing list.

Please select how to be contacted

You can unsubscribe at any time by clicking the link in the footer of the emails.

We use Mailchimp as our marketing platform. By clicking below to subscribe, you acknowledge that your information will be transferred to Mailchimp for processing. Learn more about Mailchimp’s privacy practices here.


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.