DATA POLICY

The first purpose of data policy should be to serve the objectives of the organization or project sponsoring the collection of the data. With research data, data policy should also serve the broader goals of advancing scientific and scholarly inquiry and society at large. This is especially true with government-funded data, which likely comprise the vast majority of research data. Data policy should address multiple issues, depending on the nature and objectives of the data. These issues include data access requirements, data preservation and stewardship requirements, standards and compliance mechanisms, data security issues, privacy and ethical concerns, and potentially even specific collection protocols and defined data flows. The specifics of different policies can vary dramatically, but all data policies need to address data access and preservation. Research data gain value with use and must therefore be accessible and preserved for future access. This article focuses on data access. While policy might address multiple issues, at a first level it must address where the data stand on what Lyon (2009) calls the continuum of openness. Making data as openly accessible as possible provides the greatest societal benefit, and a central purpose of data policy is to work toward ethically open data access. An open data regime not only maximizes the benefit of the data, it also simplifies most of the other issues around effective research data stewardship and infrastructure development.

 permits the creation of new data sets, information, and knowledge when data from multiple sources are combined,  helps transfer factual information to and promote capacity building in developing countries,  promotes interdisciplinary, inter-sectoral, inter-institutional, and international research, and  generally helps to maximize the research potential of new digital technologies and networks, thereby providing greater returns from the public investment in research." In the 2009 "Climategate" controversy, senior climatologists were accused of manipulating important global temperature data. This provides one stark example of the importance of open research data for the integrity of science. While multiple, high-level investigations cleared the researchers of any willful wrongdoing, the investigations also emphasized the need for data to be more openly available to ensure credibility and avoid future misguided controversy (House of Commons Science and Technology Committee, 2010; Oxburgh et al., 2010;Russel et al., 2010). Indeed, data transparency has proven to be an antidote to those rare instances of scientific fraud (Ekborn et al., 2006). Uhlir and Schröder (2007, Table 2) give multiple examples of successful open science networks. The openness of these networks enhances the interoperability of their data and permits the widest range of experimentation and development.
Research is now beginning to quantify the benefits of open data. Piwowar et al. (2007) find that open cancer trial data was significantly associated with a 69% increase in citations, independent of journal impact factor, date of publication, and author country of origin. Pienta et al. (2010) find that sharing social science data, especially sharing data through an archive, leads to many more times the publications than not sharing data. Blumenthal et al. (2006) report that more than 80% of respondents in a large survey of life scientists had a positive experience when sharing data. On the flip side, Vogeli et al. (2006) report that more than half the respondents in a survey of young researchers found that data withholding had a negative effect on the progress of their research.
Data policy has evolved to increasingly emphasize open access to data. Historically, research data have been seen as the researcher's private property or perhaps as a national asset that should be protected or exploited as a commodity.
In the United States and several other countries, this attitude is changing toward viewing data as a common, public good.  Uhlir and Schröder (2007) note several other limited cases where public interest might be served by limited proprietary restrictions, notably with public-private relationships. These restrictions should be rare, explicit, and well justified. A new Royal Society study in the United Kingdom explores this issue of "Science as a public enterprise" and asks "how scientific information should be managed to support innovative and productive research that reflects public values" (http://royalsociety.org/policy/sape/). Overall, we best serve science and society when we view data as a common, public good, but the public domain status of factual data is a complex legal subject (see, for example, Boyle, 2003;Reichman & Uhlir, 1999;Reichman & Uhlir, 2003). Nevertheless, past examples demonstrate the benefit of making data available in an open "commons" with very limited retention of intellectual property rights through permissive licenses or by formally asserting the data are part of the public domain. An intriguing approach has been proposed by Creative Commons in their Protocol for Implementing Open Access Data (http://sciencecommons.org/projects/publishing/open-access-data-protocol/). The idea is that researchers place their data as fully as possible in the public domain while also asserting expected norms (not legal requirements) of ethical behaviou Casey, Pulsifer, and Tilmes r for users and providers of the data. Examples of this approach include the interdisciplinary Polar Information Commons (http://polarcommons.org/) and the Personal Genome Project (http://www.personalgenomes.org/).
This approach of an information commons is likely to be the simplest, most useful, and sustainable approach to address issues of open data access. We must, however, recognize legitimate ethical restrictions to data access that protect privacy of human subjects, respect the rights of Indigenous knowledge holders, and avoid situations where data release could cause harm. (These sort of ethical restrictions are well-described in the data policy of for the very large and interdisciplinary International Polar Year: http://classic.ipy.org/Subcommittees/final_ipy_data_policy.pdf) Wherever possible, necessary restrictions should be guided by norms of ethical research rather than by legal contracts.

TEN-YEAR VISION
In ten years, all research data are readily discoverable and the vast majority of data are open and in the public domain. Data are used ethically according to the norms of the research community, including fair attribution.

CURRENT CHALLENGES
Despite the clear benefits of open data and existing high-level, international polices and guidelines, implementation of the principle of full and open access is highly variable at the national level. In some nations, the national policy is entirely inconsistent with the principle. Research data collections, in particular, are often the most likely to be unavailable. A recent PARSE.Insight (Insight into issues of Permanent Access to the Records of Science in Europe) survey suggests that only 25% of research data across many disciplines are openly available (Kuipers & van der Hoeven, 2009). Furthermore, there is huge variability in attitudes toward data sharing across research disciplines (Key Perspectives Ltd, 2010). Some of this restriction might be a result of past practice that encouraged embargoing of data until formal publication. Some of it might be related to perceived commercial value of data and the role that lawyers play in promoting agreements for intellectual property rights. Reasons researchers cite for denying data access include concern over data misuse and ethical or legal breaches (i.e., a lack of trust) and simply the effort and cost necessary to make data available. Significant barriers to open access and wise data stewardship also arise from the lack of professional data management resources and training within research programs as well as inadequate support for data archives (Campbell et al., 2002;Hedstrom & Niu, 2008;Kuipers & van der Hoeven, 2009;Parsons et al., 2011). In some disciplines, patents and other restrictions are used as a purported way to provide financial incentive and to support future research, but these incentives often do not live up to their financial promise, especially in academic research where restriction imposes more cost than reward (Schofield et al., 2009).
More generally, it appears a major challenge to increased data sharing is rooted in the culture of science and the research academy. A researcher's merit is judged largely on the number and quality of peer-reviewed publications. This one-dimensional metric provides little incentive to compile and document data beyond the needs of the original research. Indeed, it creates an incentive to restrict data access in order to maximize the number of publications a researcher can produce from the data. There is a cost to a researcher to share data, but others receive most of the benefits. Data producers can be leery of sharing data with "outsiders". This results in project or disciplinary data Data Science Journal, Volume 12, 10 August 2013 GRDI45 silos that hinder interdisciplinary research. A recent symposium hosted by the US National Academy of Sciences notes that researchers in developing countries are affected most by these challenges, and that, in all countries, there is a lack of norms and traditions for open data sharing in collaborative research. Furthermore, governments in many developing countries treat publicly generated or publicly funded research data either as secret or commercial commodities (http://sites.nationalacademies.org/PGA/biso/PGA_061508).

RESEARCH DIRECTIONS PROPOSED
Achieving an open and ethical data regime will require both a top-down and bottom-up approach. From the top down, work is needed to harmonize policy across national jurisdictions in accordance with common principles of openness and ethical use. These policies then require support for implementation at the national level. From the bottom up, research communities need to define and develop norms of ethical, collaborative data sharing. While we can and should converge around common principles, the details of data policy will necessarily vary with discipline and research culture. Therefore, a crucial challenge is making policy formation a dynamic, interactive process involving all stakeholders. The challenge is not only to ensure that policies are met with concrete actions by practitioners, but also that the practical experiences and issues related to applying the policies are heard and fed back into the policy. This helps ensure "buy-in" and can motivate researchers to invest time in policy development and maintenance. In this light, the most sustainable and adaptable policy approach is one based on community-accepted norms of ethics and behaviour rather than rigorous enforcement of licenses and contracts. This norms-based approach has historically guided research ethics; and the institutions of academia, research sponsors, and publishers have upheld research integrity. The norms themselves (citation of original work, peer-review, etc.) are based on core research principles such as reproducibility, transparency of methods, and evidence-based assertion. Open data access with appropriate ethical restrictions can be viewed as a new core principle for developing a global research infrastructure.
While keeping a strong emphasis on open data, we must also recognize that there are great complexities in the details. An evolving legal and social science research agenda is needed to best balance between society's need for open data and the need to protect people, heritage, endangered species, and cultures from misuse. Information science has begun to explore some of these issues in medicine (Berner & Moss, 2005), human rights (Cook, 2006), and locational privacy (Kisselburgh, 2008), but broader research focused more on data rather than publications is needed. A related and important issue is trust, both in terms of trusting others with one's data sets, and trusting the data that one comes across in a journal article (how accurate or believable are the data and results?). More research is needed around issues of researcher identity and effective incentives to foster trust and motivate data sharing.
The research community and funding agencies must recognize the intellectual effort necessary to compile and document a good data set and provide the relevant incentives to produce and share good data. Encouraging formal data citation as a means of crediting data authors (Parsons et al., 2010) is one example of a soft incentive to encourage data sharing, but it is unclear how effective this incentive is. Often the strongest incentive to encourage sharing is when depositing data in an open archive is a requirement for ongoing research funding and when that requirement is supported by identified, funded archives (Parsons et al., 2011). A legitimate question for research, then, is whether these sort of hard requirements lead to lower quality data being deposited. To address this concern, it is important that a good working relationship be established between the archive and the data provider.
In general, enhancing the relationship between data producer, archive, and data consumer can have multiple benefits (Parsons et al., 2011). This suggests, as has been stated elsewhere (ARL, 2007;ICSU, 2004;Moore & Anderson, 2010;NSB, 2005), that there will be an increasing need for informatics specialists working with social systems and improving human understanding as well as grappling with complex technical issues. Society will increasingly rely on these professionals, who act as translators, to make complex, distributed data accessible and useful. Training, development, and reward structures for these new professionals need to be top priorities. At the same time, researchers themselves need education through structured curricula in the fundamentals of data management.

RECOMMENDATIONS
At a first order, research funding agencies and foundations that fund data collection must assume responsibility for the management of those data and the development of the necessary infrastructure to support their preservation and use. To maximize the return on this investment and to enhance efficiency, it is in the sponsor's best interest to demand that the data collected be archived and be as openly accessible as possible. This means that a strong, highlevel policy statement about openness and transparency is an important first step. It is critical that national governments and public research institutes work toward a more harmonized policy around the principles of open access and ethical use. To that end, the US National Science Board (NSB) Task Force on Data Policies has developed a draft statement of principles that, while preliminary, can help NSF and their stakeholders to frame and examine current and emerging issues associated with science and engineering data and develop relevant policies (http://www.nsf.gov/nsb/committees/dp/principles.pdf). These principles-in concert with those mentioned earlier by the OECD, GEO, and others-can provide the necessary guidance to help other agencies and governments have a more consistently open data policy.
Most research data should be viewed as a common, networked good that is generally open, except for legitimate ethical, but not proprietary, restrictions. Data should be used in an ethical framework where data producers are given fair attribution and data uncertainties are well characterized. Data managers play a vital role in this ethical framework by working with data producers to ensure the integrity and context necessary to support robust science. Creating this ethos is clearly a long-term challenge, but short-term strategies can help move us in the right direction.
 Data policies should include clear identification of roles, responsibilities, and resources. It is especially critical to define a mechanism for data deposit into an archive; this mechanism needs to be funded, readyto-use, and easy. Additionally, there should be professional data curators responsible for identifying data for archiving and encouraging and assisting data producers with sharing their data.  Data management plans and archiving should be required and funded as part of basic research proposals.
The new NSF policy mandating data management plans in proposals is an example (http://www.nsf.gov/publications/pub_summ.jsp?ods_key=nsf11001).  Policy implementation needs to make sure sponsors of the data collection involve all stakeholders in active leadership.  Basic data management training should be included in the core scientific curriculum. Just as all types of researchers are required to take a research methods class to get an advanced degree, they should also be required to take a "data class".  Sponsors, journals, reviewers, and academia at large need to encourage fair and formal credit and attribution for data producers. It is unclear how great an incentive attribution is, but if credit is not provided, it will dissuade many from contributing their data.  Data curators need to establish close working relationships with data producers (as well as data users), based on mutual trust. They should clearly demonstrate the value of their archives to both data consumers and producers, and they should provide information to data producers about how their data are being used and by whom.
Ultimately, to develop a robust and useful e-infrastructure for data, the entire research community must work toward making data as open as possible while defining the boundaries of ethical use.