Signal transduction by members of the nuclear receptor (NR) superfamily of transcription factors encompasses interactions with small molecule ligands and coregulators that control cell- and tissue-specific transcriptomes in a wide variety of developmental and physiological contexts (McKenna and O’Malley, 2002, Mangelsdorf et al., 1995). Basic and clinical researchers in this field frequently pose many fundamental questions that directly relate to unappreciated or undeveloped aspects of NR signaling biology. What NR pathways regulate my gene of interest? What genes are most consistently regulated by a given NR pathway, and how do these targets differ between different tissues? What NR pathways impact my cellular process of interest in different tissues? Although the NR signaling field has generated a large number of expression profiling datasets involving perturbations of NR signaling pathways, numerous factors combine to complicate re-use of these datasets to answer these and other biological questions. Datasets are not consistently archived (Ochsner et al., 2008) and those that are archived in repositories such as the US National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) (Barrett et al., 2009) and European Bioinformatics Institute (EBI) ArrayExpress (Kolesnikov et al., 2015) are frequently under-annotated and poorly exposed for discovery by researchers. The informatic isolation that results limits the reuse of these datasets as a biological continuum, preventing the routine generation or validation of research hypotheses by the NR signaling community. In many cases, researchers must invest time, effort and money in designing and carrying out an experiment for which relevant data points have already been published. Apart from the wasted effort, the current period of financial austerity in research funding makes a strong case for the development of tools that will provide for more effective and efficient use of already existing, but currently peripheral, data points.
The recently elaborated FAIR (findable, accessible, interoperable, re-usable) principles on data stewardship (Wilkinson et al., 2016) articulate an ideal biomedical research enterprise in which the major research roles, such as bench researchers, data repositories and publishers, interact in a way that promotes improved efficiency and re-use of research assets. The Nuclear Receptor Signaling Atlas (NURSA) was formed in 2002 with a mandate of improving the discoverability, accessibility and re-use of datasets for the NR signaling community and related research constituencies (Becnel et al., 2015). As a domain-specific biocuration and dataset repository, the FAIR principles have a high degree of relevance to our mission to enable researchers in this field to make greater use of the abundant ‘omics-scale datasets that this field has generated. This paper describes our implementation of the FAIR principles as they relate to biomedical data stewardship, demonstrating the problems we encounter, and our FAIR-aligned solutions to those obstacles. We illustrate our approach with reference to specific examples and use cases, including an interoperability collaboration between NURSA and the Pharmacogenomics Knowledgebase (PharmGKB) that connects research communities with common interests in transcriptomic datasets relevant to nuclear receptor signaling.
The methodologies supporting the data stewardship approaches described in this paper are discussed in detail in a recent publication (Darlington et al., 2016). Briefly, landing page fields for the biocurated datasets were aligned with recommendations of the Joint Declaration on Data Citation Principles (Martone, 2014) and DataCite (DataCite, 2015) organizations. Integration of the third party reference manager-integrated dataset citation widget was based upon the RIS file format standard. Automated integration between NURSA and the Pharmacogenomics Knowledgebase (PharmGKB) resource was achieved through the development of a RESTful application programming interface (API) (Darlington et al., 2016), permitting gene calls using Entrez Gene ID or approved gene symbol or, for small molecules, PubChem ID. Deposition of dataset metadata with dataset indexing services and search engines was supported by use of the Open Archives Initiative Protocol for Metadata Harvesting standard (Archives, 2016).
At the inception of NURSA’s third funding cycle in 2012, we embarked upon an extensive re-appraisal of NURSA website content and user interface. To tailor our modifications in scientific scope and web design principles to the needs of our end users, we carried out a survey of the NURSA user community. Although end users expressed satisfaction with the site overall, they requested an increased emphasis on ‘omics dataset integration and analysis tooling. Based upon this response, and given that that they represent the most abundant ‘omics modality in the field of nuclear receptor signaling we set out on a systematic effort to enhance the re-use of transcriptomic datasets in the field. Here we highlight key aspects of our biocuration and web development approach as they relate to each of the four components of FAIR in turn, discussing for each the problems confronted and the solutions adopted.
1.A. The issue
Multiple points of failure in the existing model of academic scientific funding, research and publishing have contributed to poor findability of transcriptomic datasets. Reluctant to impose additional administrative burdens on authors, many publishers default to a laissez-faire disposition towards dataset archiving, such that only journals with the highest submission volume and rejection rates – the Nature Publishing Group and Cell Press families, for example – can afford to mandate and actively enforce deposition of discovery scale datasets as a condition of publication. To compound matters, journal editorial staff lack the technical expertise to oversee dataset deposition, and manuscript reviewers are primarily preoccupied with the scientific content of the manuscript rather than the deposition status of the associated dataset. As a consequence, for the vast majority of journals we have encountered during our biocuration activities, deposition of a high quality, well annotated dataset, and inclusion of its accession number in the final published version of the article, are entirely at the author’s discretion. Due to a lack of academic incentivization and unwillingness or inability on the part of many principal investigators to commit time and resources, many ‘omics scale datasets are absent from public repositories (Ochsner et al., 2008, Witwer, 2013). Among the more ironic manifestations of this situation are those articles supported by unarchived transcriptomic datasets, but that validate their own findings using publically deposited transcriptomic datasets (Huber-Keener et al., 2012). Although the NIH Genomic Data Sharing Policy (Health, 2016) goes some way to addressing the problem of failure of researchers to archive datasets, it can mandate only that accession numbers be provided at the time of grant renewal (typically once every five years), rather than at the time of publication of the associated article.
The lack of active engagement on the part of many publishers with respect to dataset deposition has negative repercussions for the integration between GEO, ArrayExpress and NCBI’s public bibliographic database, PubMed. Given that many GEO accession numbers are absent from final accepted versions of articles, and consequently unavailable to PubMed curators, integration of GEO records with the corresponding PubMed record is patchy and inconsistent (Neveol et al., 2012). As a result, GEO identifiers are inconsistently annotated in the full indexed PubMed record for a given article and, to compound matters, neither GEO nor MEDLINE support searching using each other’s unique identifiers (GSE accession numbers and PubMed IDs (PMIDs), respectively). To further complicate the situation, PMID is not a required field when the datasets are submitted to GEO, neither are authors required by journals to add PMIDs to their GEO records upon manuscript acceptance. A direct consequence of this situation is that a considerable percentage of the GEO records encountered during our biocuration activities, which we refer to as “orphaned” datasets (Figure 1), are not mapped to PMIDs. As a result, updating GEO records with their corresponding PMIDs has become a routine component of our biocuration standard operating procedure.
A final problem with regard to findability of transcriptomic datasets in the current academic research model relates to their visibility in searches. Whether in archived CEL files or supplementary PDFs, transcriptomic data points are largely opaque to popular search engines such as Google. Such search engines perform excellently in text retrieval, but are comparatively inadequate in the retrieval of experimental gene regulation data points, generating large amounts of noisy search results that require considerable parsing on the part of the user with little guarantee of gleaning meaningful information.
1.B. Our solution
The cornerstone of our strategy to improve data findability is the creation of secondary versions of the primary archived datasets and the establishment of persistent linkages between their landing pages and key relevant nodes in the digital biomedical research ecosystem (Darlington et al., 2016). Given its well-established infrastructure, broad community adoption and familiarity to researchers, the DOI standard was selected to support the creation of these linkages. NURSA DOIs support bidirectional links between NURSA dataset landing pages and publisher partner articles (Figure 2A) (Darlington et al., 2016) and, since few publishers support retroactive journal-database record linkages, PubMed records (Figure 2B). The cultivation of alliances with strong publishing brands such as Elsevier and Public Library of Science has the added benefit of strengthening NURSA’s own brand in the community and has resulted in their facilitating proactive contact of our biocuration team with authors of accepted manuscripts. This direct contact with authors addresses the largest obstacle to dataset biocuration archiving – lack of access to the original authors of the datasets – and results in more rapid deposition of more complete, comprehensively annotated datasets. Findability is also enhanced by providing for indexing of dataset metadata by dataset search engines such as bioCADDIE DataMed (Figure 2C) and Thomson Reuters Web of Science Data Citation Index (Darlington et al., 2016). To increase the visibility of NURSA transcriptomic data assets to researchers in disparate communities with no explicit connection to NR signaling pathways, we have established access points to the NURSA datasets and/or the Transcriptomine search engine from databases that curate information on small molecules (ChEBI, Figure 2D) and their genomic targets (NCBI Entrez Gene LinkOut). The benefit of NURSA’s more intensive biocuration is demonstrated in Figure 2D, which shows Gene Expression studies mapped to the ChEBI record for the GR agonist dexamethasone. Not only are NURSA records more numerous than ArrayExpress records (28 vs 21), they are more accurately mapped – note the number of Arabidopsis studies inappropriately mapped to dexamethasone. Depositors often assign names for primary dataset depositions using the title of the associated article, which frequently gives no indication as to the nature or design of the underlying dataset (Swindell et al., 2014). To mitigate against this, the NURSA biocuration team assigns names to curated datasets that unambiguously declares the essential regulatory parameters and the design of the dataset: compare the name for the GEO dataset cited above and that of its NURSA-curated derivative (Darlington et al., 2016).
Journal research articles and their associated reference lists represent rich environments for the discovery of unfamiliar science. By providing for citation of their datasets as scholarly works alongside citations of research articles in article reference lists, repositories can tap into this infrastructure to greatly enhance the findability of their dataset records. Indeed, the importance of citing datasets in articles and research proposals to their status as first class research objects is widely accepted (Borgman, 2011, Goodman et al., 2012, Margolis et al., 2014, Martone, 2014). Accordingly, we have made provision for one-click downloading of a JDDCP/FAIR-compliant dataset metadata record in a format compatible with the four major reference managers of choice. In addition to supporting their discovery by article readers, citation of datasets in this way ensures accreditation to original authors, which in turn incentivizes their deposition of future datasets to complete the cycle of discovery and re-use (Darlington et al., 2016).
2.A. The issue
The quality of archived datasets and their associated metadata is a major determinant of their accessibility. The often superficial or non-existent peer-review of datasets during the manuscript review process however, and the lack of rigorous oversight of dataset deposition, give rise to recurring problems that at best complicate, and at worst completely prevent, their downstream re-use. Problems at the dataset level that we have encountered during our biocuration activities include incomplete information on data normalization and absence of replicates (Notas et al., 2010, Sharova et al., 2007) and failure to deposit all the required data files (Boyer et al., 2005). In some more extreme cases (Ajj et al., 2013), technical details on the transcriptomic datasets are entirely absent from the materials and methods and supplementary material sections, even though they contribute in large part to the entire basis for a study, and are discussed in the results. In other cases, articles do not contain any fold changes from the arrays and as a consequence our curators have been unable to validate the processed data files against author-reported data points (Liu et al., 2015). Even many properly archived datasets are only nominally accessible to researchers lacking informatics expertise: many of the file types involved are unfamiliar to most bench researchers, and the computational investment involved in generating relative abundances and associated measures of statistical significance across multiple datasets is prohibitive. Although the GEO2R feature in GEO alleviates some of this burden (NCBI, 2016), it is limited to a single dataset, and assumes that the user is sufficiently conversant with the experimental design to set up meaningful experimental contrasts.
Transcriptomic and ChIP-Seq datasets reflect highly specific spatiotemporal contexts and accordingly, if they are to be meaningfully interpreted and compared with each other, they must be associated with a substantial amount of detailed, accurately mapped metadata. Unfortunately, the lack of familiarity with metadata standards of many of the laboratory personnel tasked with their deposition, and the limited domain expertise on the part of primary data repository curators, has given rise in many cases to cursory and uneven annotation. Rather than being mapped to community-endorsed universal identifiers for regulatory small molecules and genes or biosamples, these critical experimental parameters are often encoded in a cryptic and ad hoc shorthand. The consequences of this is that related datasets are not organized into biologically meaningful categories in public repository user interfaces, requiring users to identify and retrieve “like” datasets using free text queries with often noisy or unpredictable search results. An additional consequence is that our own biocurators are required to expend considerable time and effort in retrospectively parsing metadata from the related publications.
2.B. Our solution
As the number of datasets in our resource grows, so the need increases to organize biologically related datasets for routine retrieval by specific research constituencies: some researchers are interested in all datasets related to a specific organ, for example, whereas others might be focused on datasets related to a given signaling pathway across all organs. The two primary descriptors that define a transcriptomic datasets are the biosample from which the RNA was generated, and the regulatory molecule that defines the primary experimental variable (Becnel et al., 2016). To enhance the accessibility of datasets to users of our resource, our biocuration approach incorporates a step that maps all datasets to hierarchical “catch-all” terms that group datasets related by pathway and biosource. In this way, users are not required to run multiple iterative queries to retrieve all datasets related to a given signaling pathway, or physiological system and organ (Becnel et al., 2016). Instead, they can filter datasets in the directory according to their pathway or biosample of interest using intuitive drop-down menus (Figure 3). An additional benefit of this approach is that when researchers arrive at a dataset landing page from an external site, they can identify datasets related by regulatory molecule or by biosource (Darlington et al., 2016).
2.B.2. Quality Control
Once an accession number is issued by a primary data repository and the associated paper is published, opportunities to correct the data record are greatly limited. There is little enthusiasm either on the part of authors to correct problems with archived datasets, or on the part of publishers of articles associated with such datasets to mandate such retrospective action from the authors. Given that the time and effort required on the part of our biocuration group to troubleshoot flawed depositions would exclude timely curation of properly-archived datasets, our role with respect to depositions with irretrievable deficits is necessarily limited to flagging them for exclusion from our database. One notable success story, and a validation of our model of retrospective biocuration of primary datasets, is the instance of a mismatch between relative abundance values generated during our biocuration and those reported by the authors in the article, which resulted in a correction to the published article (Mamrosh et al., 2015).
2.B.3. Web Development
Our web development approach places a strong emphasis on usability factors and of the amenability of the datasets to re-use by researchers: put another way, users are unlikely to make use of a resource that is not intuitively accessible. When a researcher lands on a dataset page, experimental contrasts and their associated transcript relative abundances and measures of statistical significance are all pre-defined and accessible using a drop-down menu (Figure 3B). Detailed metadata are available at both the experiment and dataset levels, and detailed regulation reports on individual transcripts across the entire universe of data points can be visualized in visually engaging scatterplots with the click of a mouse (Becnel et al., 2016). Given the widespread adoption of mobile devices, in particular cellular handsets, to access Internet content, we adhere to responsive principles in web design, ensuring as far as possible that the accessibility requirements of mobile devices users are addressed. Finally, data and metadata can be downloaded in spreadsheet format for downstream analysis in a user’s third party software of choice.
3.A. The issue
The same biocuration deficits that beset findability of datasets in GEO and ArrayExpress impacts their automated interoperability with external entities wishing to leverage the underlying data points to add value to their own resource. The problems posed are exemplified by the fact that despite being components of the same organization, biologically meaningful connections between GEO and other NCBI entities are yet to be established – the PubChem record for 17β-estradiol, for example, does not contain a listing of GEO datasets in which this small molecule is an experimental variable. This lack of integration represents a substantial missed opportunity to leverage transcriptomic datasets as knowledge mines to connect disparate research communities.
3.B. Our solution
To support interoperability with external resources, data points across all NURSA datasets are interoperably exposed via RESTful web services for retrieval using controlled vocabularies and regulatory molecule unique identifiers (Darlington et al., 2016). An example of the use of APIs to connect NURSA with resources curating content complementary to its own is an ongoing BD2K-funded interoperability project between NURSA and the Pharmacogenomics KnowledgeBase (PharmGKB, www.pharmgkb.org). PharmGKB is a comprehensive compendium documenting the effect of variations in the sequences of human genes on the response to drugs and is widely used by both clinicians and basic researchers (Thorn et al., 2013). Given that many drugs are small molecule regulators of NR function, and NURSA curates datasets in which these molecules are regulatory variables, the establishment of automated, programmatic connections between NURSA and PharmGKB would enhance the other’s website content and the research experience of their respective user bases. The NURSA and PharmGKB biocuration standard operating procedures involve mapping records to universal molecule identifiers such as PubChem ID and Entrez Gene IDs, which are exposed through APIs that both groups have developed. Accordingly, API-supported interoperability was established between the two sites such that PharmGKB records for drugs that are NR ligands linked to NURSA datasets in which these molecules were perturbants (Figure 4A) and, reciprocally, available PharmGKB-curated drug metabolism pathway and pharmacogenomic drug-gene variant interaction modules were displayed in appropriate NURSA ligand Molecule Pages (Figure 4B). The net result of this interoperability is to expose essential information on aspects of drug action that might not otherwise be readily accessible to, respectively, the pharmacogenomic and nuclear receptor signaling research communities.
4.A. The issue
The format of the journal research article, predicated upon lengthy exposition, interpretation and attribution, has remained substantially unchanged since its introduction in the mid-19th century. The structural and dimensional constraints of the research article dictate however that it is neither practical nor feasible to convey all the findings from global expression studies in detail. As a result, authors of articles containing ‘omics-scale datasets typically validate and interpret only those data points most relevant to their experimental hypothesis, consigning hundreds or thousands of potentially useful expression fold changes to unwieldy, patchily annotated spreadsheets or PDFs. Due to the problems with dataset findability and accessibility outlined above, pointing to primary dataset records from journal articles does little to advance the potential of these datasets for re-use.
4.B. Our solution
As described in the Findability section above, NURSA places a strong emphasis on cultivating relationships with publishers of scientific research journals. In addition to facilitating our biocuration efforts, the development of these relationships gives NURSA the opportunity to work with publisher development teams to embed links from research articles to their respective dataset landing pages on the NURSA website. By providing journal readers with one-click access to a universe of contextual data points that enrich and add value to the original research article, NURSA datasets make a significant contribution to the re-use of these data points for the discovery of unfamiliar or unappreciated biology. An example of the striking visual impact of the Transcriptomine Regulation Report is shown in Figure 5, which summarizes the NR signaling pathways impacting expression of the gene encoding fatty acid binding protein 4 (FABP4), a lipid transport protein present primarily in adipose tissue and macrophages. Firstly, numerous animal and cell model data points place expression of the gene in context, redundantly documenting induction of FABP4 in adipogenesis (Ochsner et al., 2009, Christian et al., 2005) and the correlation between its expression and fat whitening (Wu et al., 2012). Nextly, data points from numerous datasets illustrate the close regulation of FABP4 expression by the PPARA and PPARG signaling pathways, two very well characterized adipose regulatory paradigms (Szatmari et al., 2007, Zacharewski et al., 2013, Ochsner et al., 2009, Finck et al., 2005). Other signaling modalities that are convincingly conveyed by the FABP4 Regulation Report are: repression by the AR/androgen pathway (Lin et al., 2009, Kazmin et al., 2006), consistent with androgenic suppression of adipogenesis (Chazenbalk et al., 2013, Singh et al., 2006); and induction by the GR/glucocorticoid pathway (Hoffman et al., 2005, James et al., 2007, lab et al., 2010), reflective of the widespread use of the GR agonist dexamethasone as an adipogenic stimulant in cultured cells (Scott et al., 2011).
The existing NURSA infrastructure, developed over 14 years of funding with support from six different NIH institutes, has to date has focused on transcriptomics of nuclear receptor (NR) signaling pathways (Becnel et al., 2015). NURSA has developed an international userbase with a nearly 3,000 person-strong e-mail newsletter following. It is a J2EE web application with a D3.JS-based interactive graphics user interface and a platform aware design, as well as controlled terminologies for semantic interoperability and RESTful web services for data exchange and syntactic interoperability (Becnel et al., 2015). Our resource combines a number of innovative strategies and features that both distinguish it from, and complement existing pathway resources such as Reactome (Fabregat et al., 2016) and Pathway Commons (Cerami et al., 2011). These include publisher alliances to link articles to biocurated datasets (Darlington et al., 2016), novel approaches to biocuration and mining of ‘omics datasets (Becnel et al., 2017), and providing for routine citation of datasets as first-class research objects (Darlington et al., 2016).
Public transcriptomic archives such as GEO and ArrayExpress are outstanding resources for long term deposition of raw files for transcriptomic datasets, and are deserving of continued financial support. Unfortunately, datasets in these archives have not been effectively integrated with an expanding volume of scientific research output, a state of affairs that substantially undermines their collective informatic potential. Such deficits relate primarily to the fact that these repositories operate under a passive model with respect to public dataset deposition. As such, the onus is on investigators to approach the repository, rather than the repository actively seeking out publically-funded datasets associated with accepted manuscripts. In response to this we have combined proactive engagement of authors and publishers by our biocuration team with data analysis tool development to place data points from transcriptomic studies relevant to NR signaling pathways at the fingertips of research biologists (Darlington et al., 2016) (Becnel et al., 2016). Future efforts will extend our model to other ‘omics modalities and pathway paradigms to provide basic and translational researchers with broader insights to the relationship between cellular signaling and human disease.
Although such direct feedback from the community is encouraging, our model does have a number of limitations. Principle among these is the uncertain extent to which the time and effort invested in biocuration is repaid in the form of added value for data re-users. Given the popularity of research resources and tools that do not incorporate a biocuration element, such as Galaxy (Afgan et al., 2016), it is far from clear that the value of biocuration is universally appreciated. Another problematic aspect of our approach is that by its very nature, interoperability between two resources establishes a dependence of each upon the other for full functionality of their respective resources. As more and more nodes are connected interoperably via web services, so the need for each resource to maintain and update their web services and documentation grows proportionately, which in turn adds to the workload for software architects and web developers. A third limitation is one that applies to the field of biocuration more generally, and that relates to the existing prevailing model of academic reward and recognition. Although their work is just as intellectually demanding as bench research, biocurators – at least in the field of cell signaling – do not enjoy the same potential for personal recognition and reward that is afforded hypothesis-driven bench scientists. The change in biomedical research culture heralded by the BD2K initiative (Margolis et al., 2014) suggests however that on the part of NIH at least, there is a commitment to making biocuration attractive as a viable career option for a broader population of trained scientists.
The FAIR Principles were formally published in 2016 as the synthesis of extensive efforts and discussions across the data science and scholarly research stewardship communities (Wilkinson et al., 2016). Our intent in this paper was to convey the experiences of a biocuration and web development group in a specific area of research – nuclear receptor signaling – that has incorporated the FAIR principles into its standard operating procedures. This is not to imply that issues around accessibility and interoperability are limited to basic biology: poor deposition rates to clinical trials repositories such as ClinicalTrials.gov have also been reported, for example (Piller, 2015). This issue is compounded by partial reporting and bias toward the selective publication of positive findings (Rifai et al., 2014), despite the intrinsic value of data from failed trials om helping avoid duplication of expensive human and co-clinical trials. Metadata and data exchange standards exist from the Clinical Data Interchange Standards Consortium (CDISC). whose tabulation and analysis datasets standard structures are now required for regulated clinical trial submissions to the Food and Drug Administration and Japan’s Pharmaceuticals and Medical Devices Agency (FDA, 2014). These initiatives notwithstanding however, no broad requirement for data standardization for non-regulated clinical protocols currently exists.
It can be quite reasonably pointed out that we have not provided here any objective, quantitative metrics on the efficacy of our approach. Given the fact that our FAIR biocuration strategy has only recently reached maturity however, we submit that any current analysis of the efficacy or impact of our efforts, or the FAIR principles in general, would be previous. That is not to diminish in any way the importance of such an appraisal taking place in the future however, and a reliable perspective on the true value of the FAIR principles can emerge only from a series of comprehensive and objective retrospective appraisals of their impact across a variety of scholarly fields. A number of models have been proposed for such evaluations, most notably the case study-based approach espoused by Darke, (Darke et al., 1998) Yin (Yin, 2013) and colleagues. The experiences of biocuration groups such as our own, as well as of data re-users in the research community, will be of considerable value in furnishing data for such case studies, so that the lasting impact of the FAIR principles can be accurately – and fairly – gauged.