What you can and can't export¶
Rules applied to all transfer requests¶
- All relevant details of the files to be transferred must be provided with every request.
- You must be a member of a registered project and must provide the RR number of that registered project when making an Airlock request.
- All files transferred may be checked by Genomics England to ensure compliance with the relevant policies. You will be notified of any files rejected along with the reason for the rejection.
- All files transferred will be checked for viruses and malware and those failing this test will be rejected. It is your responsibility to resolve such issues before re-submitting the file for transfer.
- Files requested for transfer are assessed using the following criteria:
- whether the request aligns with your Access Review Committee (ARC) approval
- whether the request can clearly be demonstrated to be aligned with a registered project in the Research Environment: please note that Research Network members with no registered project cannot export any data via the Airlock. Commercial researchers who have been approved for pre-research but do not have an ARC approved project will have heavy limitations on what they will be able to export via the Airlock.
- whether the associated project has been registered in the Research Registry for a minimum of three months, though exceptions are sometimes granted.
- any data security implications
- any disclosure risks
- the technical feasibility and associated cost of the request
- when importing data, its scientific value to the community of researchers within the Research Environment, and when and how it will be shared
- when importing data, checks will be performed to ensure that you own the data and hold the correct consents and approvals.
Genomics England will inspect analysed results to ensure they cannot be used to disclose the identity of the participant. Checking of statistical output by the Airlock Review Team will be governed by a generalisable set of principles that will guide individual decisions and ensure flexible evaluation of the Genomics England dataset. By using a principles-based approach where each case is assessed individually the security of the dataset is maintained by exporting only ‘safe’ data.
The Airlock process is governed by the Airlock Policy, which defines the process and governance of the Airlock process.
A set of Airlock Policy Guidelines presents the rules-of-thumb/principles that will be referenced by both the researcher (during preparation of analysis results) and the output checker (during output-checking).
Summary data still carry a risk of participant identification, a risk that is considerably higher when the data in question are in the public domain. Accordingly, review of transfer requests resulting in public-sharing/publication of data (in this context this covers publication in journal articles and conference abstracts/posters/presentations) will be checked more stringently and must reference the Project within the Research Registry that it refers to. Any approved Airlock export can only be used for the specific use detailed in the original export.
Binary files will not be considered for Airlock export and and we would prefer that any html files that are being submitted were first converted to pdf or other suitable formats.
Participant and Sample IDs¶
Please remove all participant, sample and platekey IDs from files that are intended for export. The way that these IDs are constructed makes them potentially identifying data, so we do not generally allow these IDs for export. If it is essential to your work to organise the data proposed for export by which patient it originated from, we recommend you replace the actual patient/sample IDs with your own pseudonyms like "Patient/Sample A, Patient/Sample B".
A frequency table is made up of a number of categories across one, two or more dimensions, with the content of the table being the number of unweighted responses that fit within the joint categories defined by each.
An example two dimensional table:
|Total||X1 + X3 + X5||X2 + X4 + X6|
Each X is the number of respondents in the joint category. For example, if columns are gender and the rows are genotypes, then X1 could be the unweighted count of males who are homozygous for the common allele.
Issues with a frequency table are:
- A count of 1 in any cell could directly reveal a participant's information. Depending on the nature of the data and categories, it may be possible for someone to identify that individual based on the singleton response. The participant could certainly identify themselves.
- A count of 1 in any cell could reveal information about the participation of an individual in the study. For example, if we know an individual is in the study we could use that information later to determine some confidential information about that individual.
- Low counts are considered risky as they could potentially lead to counts of 1 by combination with external sources. For example, if X1 is a count of three, and I know two of the participants, X1 has revealed the existence of a third. As the count increases, the risk of this happening decreases but never reaches zero. We suggest a rule of thumb of 5 to account for this (#as suggested by HES#).
- If one cell in a table accounts for the majority of participants in the row or column, then this could potentially reveal information. This is called group disclosure. For example if Cat1 is 'has mutation in Gene1' and Cat2 is 'has mutation in Gene2', and X1 is 1000 and X2 is 0, then we know that all people in the study have a mutation in Gene1. If those genes are not the primary focus of the study, then it is providing information about all the participants in the study. If the numbers are 990 and 10, then in all likelihood, any of the study participants have a mutation in Gene1. Obviously, whether or not this is problematic depends on the categories.
- If the participants are grouped hierarchically, then it is possible that frequency tables can provide information about an entity at a higher level in the hierarchy (despite passing all the issues above). For example, if we tabulated participants by hospital, where all counts are in excess of 10 and no group disclosure exists we are revealing information about GMCs, or geographical regions, that may be disclosive in other linked tables.
Reducing disclosure risk¶
Some measures can be used to reduce the disclosure risk of a frequency table:
- Redesigning the table to avoid low numbers in any of the cells, and reducing the risk of group disclosure.
- Transforming the data to percentages; it may be necessary to round the figures to avoid the counts being inferred.
- Using small number suppression by replacing cells with a count below 5 with '<5', and any corresponding opposite cells to say '>total-5'.
Private sharing Frequency tables will generally be considered safe for export where they will be privately shared with colleagues. Where there are an excessive number of cells with counts below 5, the disclosure risk may need to be reduced using small number suppression as detailed above.
Publication Frequency tables will generally be considered unsafe for export where the information will be made public. This is due to their inherent risk of disclosing confidential information on participants, either by itself or in combination with other data that may be in the public domain. Genomics England will consider requests for publication of frequency tables if the researcher is able to reduce the disclosure risk using the measures outlined above.
A magnitude table has a similar construction to a frequency table, except that the contents of the table (the X's above) comprise some numerical characteristic of the participants – for example, total or average length of hospital stay. Underlying each magnitude table is an associated frequency table that provides the number of participants whose characteristics have contributed to the magnitude table.
The issues with a magnitude table are:
- A magnitude table can be constructed from individual participants, hence revealing their confidential information. When provided for clearance, a magnitude table should always have the associated frequency table provided so that low counts can be checked for.
- Magnitude tables suffer exactly the same issues as for frequency tables above, particularly where the same magnitude information is known through other sources. For example statistics on a cohort that has been published on previously elsewhere.
- Dominance is a particular issue with magnitude tables. Dominance occurs when one respondent is much larger (in terms of the characteristic being reported) than the rest. Adding them up provides virtually no disguise for the largest contributor's information. We suggest a threshold of 50% for the largest contributor (i.e. no single individual should contribute more than 50% to the cell value # Suggestion from EU Data without Borders paper#) to ensure that the second largest cannot ascertain the information from the largest using knowledge of their own contribution.
- Instances of dominance will not be immediately apparent to output-checkers from the information provided to them. Therefore the onus is on the researchers to identify when it occurs and restructure the table to avoid it.
Reducing disclosure risk¶
The measures suggested for frequency tables above can be applied to magnitude tables to reduce their disclosure risk.
Private sharing Magnitude tables will generally be considered safe for export where they will be privately shared with colleagues, but any request should include the underlying frequency table. Where there are an excessive (>20%) number of cells with counts below 5, those cells may need to be left blank.
Publication Magnitude tables will generally be considered safe for export where the information will be made public, but any request should include the underlying frequency table. The table should be redesigned so that no single cell represents less than 5 individuals, or no single individual contributes more than 50% to a single cell value.
Maxima, minima, percentiles¶
A maximum is the value of a particular variable that is the largest within the sample.
A minimum is the value of a particular variable that is the smallest within the sample.
A percentile (or centile) is the value of a variable below which a certain percentage of observations fall.
Thus, each of these statistics could relate to an individual record and hence be disclosive.
Issues for maxima include:
- In some cases, many contributors will share the largest value. This may be disclosive if the class of people is identifiable. For example, if everyone within a particular team is known to have the same salary and we publish the maximum salary by team at that organisation, then we have revealed the salary of an entire team of people
Issues for minima include:
- Generally minima are less disclosive than maxima, as usually larger things are more noteworthy, and minima tend to be zero in a lot of cases.
- Where the minimum is not zero can be problematic. For example, if we released information from our sample that the minimum number of surgical interventions in the last year was one, then it would reveal that all participants had undergone surgery (information which was given to us in confidence).
Issues for percentiles include:
- Generally percentiles are less disclosive than either minima or maxima as their location is within the body of participants which provides some disguise. Notwithstanding this disguise, a percentile is still a piece of information given in confidence by a single respondent, and therefore caution is needed before permitting it to be released.
- For smooth distributions with a large number of respondents, percentiles will be generally safe.
- If the number of respondents is below 10, then a percentile will be an individual record (or a linear combination of two records) and in all likelihood can be identified by someone who knows about the rank ordering of the few respondents. These should not be released.
- If the distribution is bi-modal, or very highly skewed, then it is possible that a percentile could provide information about a respondent that could be identified.
- A median is the 50th percentile, hence the issues of the median include those of percentiles.
Private sharing Maxima and minima will generally be considered safe for export. Percentiles will generally be considered safe for export given they are derived from a smooth distribution of more than ten individuals.
Publication Maxima and minima will generally be considered unsafe for export given their inherent disclosure risk. Percentiles will generally be considered safe for export given they are derived from a smooth distribution of a large number of individuals (a suggestion is that each percentile band comprise at least five individuals).
The mode is the value that appears most often in a set of data. As it is the most frequent value, it is also likely to be the least disclosive.
Issues for mode include:
- If all the data points have the same value, then releasing the mode could result in a group disclosure.
- If the mode is based on a small number of observations, then it could be disclosive in the same way that frequency data are disclosive.
Private sharing Modes will generally be considered safe for export.
Publication Modes will generally be considered safe for export given they are derived from a sufficiently large number of observations and there is no concern over group disclosure. The number of observations used to calculate the mode should be provided within the request.
Means, indices, ratios, indicators¶
Indicators are statistics derived from the data. As such, they are not the confidential data provided by participants, but potentially such confidential data could be derived from them. Means, indices, and ratios can be considered special cases of indicators.
Issues for indicators include:
- The disclosure risk of the indicator will be dependent on the complexity of the function, the population size used, and whether any of the arguments within the function are publicly available. Generally, complex functions have lower disclosure risk than simple functions (consider what inferences could be made from a range value).
- Care should be taken with dichotomous variables as they can be disclosive. For example consider a dichotomous variable with value of 0,1. A mean of 0.7 in a population of 10 reveals that three participants have a value of 0 for that variable, posing a potential disclosure risk.
Private sharing Means, indices, ratios and indicators will generally be considered safe for export provided sufficient detail of the function used is provided.
Publication Means, indices, ratios and indicators will generally be considered safe for export, provided full details of the function and the population size used are given and it is suitably large.
Graphs are often used for showing and visualising developments or trends. They can be used for graphical interpretation or to illustrate coefficient in a statistical analysis.
Adhering to the following guidelines to make it more likely your graph will be considered safe for export.
- There should be no significant outliers
- The scales used should not be too detailed to prevent disclosure of nearly exact individual data points
Regression models and residuals¶
Complete regression models will generally be considered safe for export for all uses provided they have at least five degrees of freedom.
Residuals will generally be considered unsafe for export as in all likelihood the regression model will be released along with them providing an easy route to calculation of the individual data points.
Summary and test statistics¶
All summary and test statistics will generally be considered safe for export for all uses provided they are the result of a calculation on at least five data points.
Nucleotide sequence will generally be considered unsafe for export for all uses. In some exceptional cases it may be considered for export but would likely be considered identifiable data and would therefore require express consent from the affected participant(s).
Frequent reasons for request rejection¶
The most common and easily fixable reasons for rejection or delay of Airlock requests:
- Participant and Sample IDs: The participant and sample IDs used to label samples within the RE are not allowed for export from the RE. If you must use some kind of ID system in your file(s), replace the genuine IDs with something like “Sample 1, Sample 2, etc”.
- “Less than 5 rule”: Cells with counts for phenotype data with counts less than 5 will generally not be allowed for export and should be masked (by changing them to something like “<5”). This also applies to counts that would imply counts <5 for other cells - e.g. if a cohort consists of males and females, an overall count of 100 is given, and the actual counts are 98 females and 2 males, these would need to be represented as “>95” females and “<5” males. Please note: this rule does not apply to genotypic data – if you are exporting counts of, for example, variants in a certain gene in a cohort, you do not need to mask counts below 5.
- Insufficient explanation of why the data needs to be exported and what the file shows: When requesting the export of data, please make sure you give a full explanation of why the data is being exported. If the data is intended for further analysis, this must include a full explanation of why this analysis cannot reasonably be performed in the RE. If there is ambiguous information in the file (e.g. unlabelled columns, information in a non-standard format such that the average researcher in the field of genetics would not know what was being represented), please make sure the request form explains what is actually being represented.
Note on references¶
The development of the following principles has been influenced by, and in some cases directly taken from, the following documents:
- Guidelines for the checking of output based on microdata research, by Steve Bond, Maurice Brandt, and Peter-Paul de Wolf.
- Self-study material for the users of Eurostat microdata sets, European Commission.
- Guidelines for the checking of output based on microdata research, by Maurice Brandt, Luisa Franconi, Christopher Guerke, Anco Hundepool, Maurizio Lucarelli, Jan Mol, Felix Ritchie, Giovanni Seri, and Richard Welpton
- Principles- versus rules-based output statistical disclosure control in remote access environments, by Felix Ritchie and Mark Elliot