Transparency, and open science, are crucial in all stages of research, this includes prospective registration of studies, code and data sharing and high-quality reporting of studies. Open science practices are important to avoid simple questionable research practices such as ‘p-hacking’, HARKing (hypothesising after results are known), and at the most basic level, selective reporting ie. picking the ‘best’ results to include in a paper. Although with recent high-profile examples of research fraud and data manipulation, resulting in potentially millions of dollars wasted, the need for transparency in research is becoming even more important. Preregistration is one important method used to reduce questionable research practices and is typically strongly recommended by the open science community. However, there are pushes for preregistration of studies to be mandated, this is currently focused in the domain of clinical trials (as it should be), The focus of this post will be on the role of study preregistration in observational research using real-world data, which is an area rapidly growing in popularity and being used as evidence by decisionmakers to inform policy.
Why do we preregister studies?
At its core, study preregistration is simply the act of specifying your plan for your research before you start it and sharing this on a register (ie. clinicaltrials.gov, OSF, etc.). Preregistration’s benefits have been clearly demonstrated for all research with prospective data collection, ie. clinical trials. Because of this, I will only briefly touch on these.
In clinical trials, it has been recommended by the International Committee of Medical Journal Editors (ICMJE) that all trials be pre-registered. This has slowly been adopted in journal policies, however, is far from being universal. The reason for this recommendation is that preregistration allows people (ideally in peer review) to check the submitted manuscript, against its registration for deviations, making any selective reporting, p-hacking and HARKing more visible, reducing its prevalence. Selective reporting is probably the most common and occurs when the results which are not sexy or positive are omitted from the publication. This can result in conclusions of studies being misleading, and results in a lot of research never being published, contributing to up to 85% of research being wasted. Preregistration allows this to be noticed and questioned, hopefully reducing the occurrence of selective reporting. Essentially, preregistration reduces the ability of researchers to make changes to their methodology after finding out their results, reducing the ability for the results to guide the methods.
A common concern is that preregistration locks authors into a plan when the research process should change as more information is gained. This isn’t the case, as there is nothing mandating that the protocol has to be rigid, being completed exactly as was intended, however, preregistration provides a base to explain why these changes took place, which requires authors to provide justification for any changes, improving how trustworthy their research is.
Clearly, preregistration is important for clinical trials, however, the ICMJE made an important distinction and made observational research exempt from preregistration. However, a typical ‘open science bro’ might say everyone should preregister everything because it’s always going to reduce questionable research practices. However, that may not be the best solution.
As discussed above, we preregister studies to reduce selective reporting, p-hacking, and HARKing. However, preregistration only reduces this if the data collection has to occur after registration, and it can be empirically verified. This is becoming more common in clinical trials but is not the case for observational studies.
Preregistration in observational studies
Preregistration is (relatively) easily enforced due to the requirements of clinical trials and prospective data collection. However, what about when the data already exists? There are two examples which I think support why preregistration does not benefit observational research much. Let’s say the authors of a paper are a bunch of researcher-clinicians who work at a hospital, they think, “I think we should have a look at the effect of drug X.” They get a waiver of consent from their ethical review committee and collate all the data. They have a look through it, run a few analyses and come up with an interesting result.
How would preregistration benefit us here?
Perhaps preregistering their observational study might make them think more about their study design, but they also don’t know what data they actually have yet, they have a broad idea but feel like it’s not good enough to write up a study plan listing everything required for their study. Basically, they don’t want to shoot themselves in the foot by specifying they need a certain confounder measured, prior to knowing that it is indeed measured in their data. You could say, this is no excuse, if you can’t do science transparently, or don’t have the data to conduct the desired study, don’t do it at all. This is definitely true. Having less, but higher quality research would benefit everyone but that is not what happens in reality. Our authors decide to go ahead with the study and hold off ‘preregistering’ the study until they can figure out what they have measured in the data. they register the study, knowing that nobody can verify that they didn’t have access to the data prior to preregistering. In this scenario, preregistration is useless.
Another example, which is becoming increasingly common is the use of ‘big data’ in research. To access this data, which may include electronic health records, claims or other linked data, researchers typically require approval from the data custodians. Currently, for the large data custodians of these sources of big data (the many Nordic registries, the US Veteran’s Affairs health database, or Medicare claims database) to release the data, the process involves submitting a research proposal which is assessed and approved prior to access to the data is given. I am not aware of any custodian that requires this proposal to be publicly registered. If this was required, preregistration would have the same beneficial effect for observational research from ‘real-world data’ as for clinical trials. This is because it would be verifiable that the registration occurred prior to exposure to the data. However, even in the most structured cases, where data custodians grant access to the data, this is not the case, so if an observational study is preregistered, we just have to trust researchers that they did indeed register it prior to having access to the data, and we’re left in the same position as last time, preregistration relies on trusting the researchers are telling the truth, an assumption which is increasingly harder to believe.
So the goal of preregistration is to reduce selective reporting and HARKing, where results are chosen after the fact based on how much they support the researcher’s agenda or will help them get published. In clinical trials, preregistration works as an excellent mechanism to reduce HARKing, but is not very effective at all for observational studies, as the timing of data access is not measured nor public, therefore it is impossible to verify the ‘pre’ in preregistration. You could still argue that most researchers will act honestly and therefore on average preregistration will improve how thoughtful authors are in designing their study and reduce HARKing. But what if there were negative effects of preregistration?
Every decision requires weighing up the benefits and harms of the effect of that decision. Something not often discussed are the potential ‘harms’ of preregistering studies. These harms could include dichotomising evidence into preregistered or not, resulting in the dismissal of studies not preregistered as being inferior, or at a higher risk of bias than their preregistered counterparts. This is only true if indeed preregistering does reduce the risk of bias, which in observational studies, is unclear. Indeed preregistering a study may be associated with higher study quality as researchers more likely to preregister may also be more likely to conduct a rigorous study, however, preregistration may not cause them to conduct a better study. Therefore, preregistration becomes an erroneous measure of quality, which could lead to ignoring high-quality evidence that is not preregistered and overstating the quality of low-quality evidence that states that it is preregistered.
In summary, there are many benefits to preregistration in clinical trials as it is easier to verify that registration occurred prior to data collection. Currently, no mechanisms exist in the infrastructure of observational data to verify preregistration, diminishing its beneficial properties. Further, as preregistration may not improve the quality of observational studies, it becomes useless as an indicator of quality, and potentially harmful if used in evidence synthesis or decision-making.
Is there anything that could be used instead of preregistration to reduce selective reporting or HARKing?
Yes, well, at least partially. It appears that specifying a target trial, ie. a hypothetical randomised trial that would be conducted in order to answer a causal question, and then emulating that trial as closely as possible with observational data improves causal inference in observational studies. This is emerging as an important methodology in observational research and is being used to provide evidence to inform decision-making. Like preregistration, emulating a target trial cannot reduce selective reporting or HARKing itself, but it will make it more visible, which by extension, may reduce authors’ willingness to undertake these practices. In observational studies, HARKing or p-hacking could be done by developing a specific combination of eligibility criteria, or outcome definitions. Importantly, this selection could be done after testing many different criteria, choosing the final one based on the results. By clearly specifying the target trial and emulation, these unusually specific criteria can then be noticed in peer review or by others, and the quality of the study can be more easily appraised. Alternatively, like in clinical trials, data custodians could begin to require public registration of the target trial protocol prior to giving access to the data. Although even this has limitations as often many studies could come from one dataset, it may not be feasible to require authors to register protocols for all potential studies from that dataset. Another opportunity is to more freely share data to allow others to reproduce and verify results, and despite barriers to this dissolving over time, it remains a rare practice when using ‘big data.’
Conclusion
Ultimately, increasing transparency in research is clearly important, although preregistration of all study types without regard to the actual implications of this should not be recommended in the current research infrastructure. Despite the benefits of preregistration in clinical trials, these are unlikely to extend to observational studies as it is currently impossible to verify whether the study is indeed ‘pre’ registered. However, explicitly emulating a target trial, and reporting the protocol and that of its emulation may reduce selective reporting by making it more easily visible.

