CRIS AND INSTITUTIONAL REPOSITORIES

CRIS (Current Research Information Systems) provide researchers, research managers, innovators, and others with a view over the research activity of a domain. IRs (institutional repositories) provide a mechanism for an organisation to showcase through OA (open access) its intellectual property. Increasingly, organizations are mandating that their employed researchers deposit peer-reviewed published material in the IR. Research funders are increasingly mandating that publications be deposited in an open access repository: some mandate a central (or subject-based) repository, some an IR. In parallel, publishers are offering OA but replacing subscription-based access with author (or author institution) payment for publishing. However, many OA repositories have metadata based on DC (Dublin Core) which is inadequate; a CERIF (Common-European Research Information Format) CRIS provides metadata describing publications with formal syntax and declared semantics thus facilitating interoperation or homogeneous access over heterogeneous sources. The formality is essential for research output metrics, which are increasingly being used to determine future funding for research organizations.

technique, such as an experimental protocol or a computer program for simulation or statistical reduction, from another topic area. As a result of one of the above or by an independent search, a researcher may find a potential collaborator or complementary co-worker for a research idea.
One measure of a researcher capability is evaluation of produced output. The more complete and accessible outputs are, the better the quality of the evaluation. The metrics imposed on the raw data (i.e., how one ranks different publication channels such as journals) are a separate issue, but without complete and verifiable raw data, evaluations are worthless. Similarly the performance of an organisational unit can be evaluated based on its outputs. Indeed, one could compare inputs (funding) with outputs as evaluated to obtain some idea of effectiveness and efficiency.
One may wish to evaluate the literature in different topic areas of fields of research. This may inform strategic decisions on research funding or areas of priority in a research institution. The literature provides a source of ideas, usually with associated research to demonstrate their potential use. This is a mine of information for the entrepreneur or innovator who wishes to invest venture capital to create products or services with associated wealth creation (jobs, profits for shareholders).
Today's teaching material is the research output of years ago. As the pace of learning increases and the volume of research output increases, there is a need for faster and easier access to appropriate research literature by educators. Modern learning is more project-based and less 'chalk and talk.' Students are encouraged to utilise technology to find relevant information.
Journalists and other media professionals need easy access to research outputs in order to find interesting 'stories' for popularising, to research (verify) the background to 'urban myths' about research, and to find researchers suitable for appearing on TV programmes or writing articles.

Conclusion
We can conclude that all these actors, in the various example roles discussed, require easy (fast, efficient) access to research output material. Technically this implies the need for excellent descriptive metadata, fast searching of metadata, fast searching of text and multimedia, and well-structured results. Furthermore, access to heterogeneous distributed repositories should appear homogeneous and local to the end-user. This implies reconciliation to a canonical syntax (structure) and semantics (meaning), which in turn is likely to involve translation of character sets, language, and ontological terms. Legalistically it requires unfettered access although restrictive metadata may document, for software to enforce, claimed rights that should be respected (like attribution) and even may define a price for access. Economically it requires a business model where costs are minimised (ideally zero as seen by the end-user), any income lies where the work is done, and costs are borne where benefit is obtained. Furthermore, ideally the actors require the research output material in the context of research project, researchers, organizations involved, facilities and equipment, funding, etc.

CERIF-CRIS
CRIS (Current Research Information Systems) have been developed over the last 40 years. Currently an EU Recommendation to member states, CERIF (Common European Research Information Format), is being adopted quite widely, and it allows interoperation. A CRIS typically has information on projects, persons, organisational units, funding programmes, research outputs (products, patents, and publications), facilities and equipment, and events. The novelty of CERIF is its formal data structure, its use of linking relations to allow n:m relationships with role and temporal duration, its use of multiple character sets, and provision of multilinguality.
Consider the following case illustrated in Figure 1 In fact, the link tables include, as well as role, the temporal information concerning start and end date-time. In this example it may be that when A authored X she was no longer a member of M. This relatively simple example illustrates the power of CERIF as a data model. CERIF is maintained by the not-for-profit organisation euroCRIS (www.eurocris.org) from whence details are available. Commercial CRIS offerings are available from [uniCRIS] which is fully CERIF-compatible, [Atira], and [Avedas]. Many funding agencies and research institutions have some form of 'home-brew' CRIS; the majority are more-or-less CERIF-compatible. The provision of CRIS in a modern e-infrastructure environment has been discussed in [Je2004].

Repositories
Repositories store and provide access to the detailed information. It is usual to separate repositories of research publications from repositories of research datasets and software (e-Science or, better, e-Research repositories) because of their different access patterns and different metadata requirements. The e-Research repositories require much more detailed metadata to control utilisation of the software and datasets in addition to metadata to allow discovery of the resources. At present they tend to be specific to an individual organisation because of their novelty and the differing requirements on metadata imposed by different (commonly international) communities, e.g., in space science, atmospheric physics, materials science, particle physics, humanities, or social science. Publication repositories typically use some form of Dublin Core Metadata

Metadata
Digitally-created articles rely heavily on both the metadata record and the articles themselves being deposited. International metadata standards and protocols must be applied to repositories so that retrieval may be consistent with appropriate recall (precision) and relevance so that harvesting (or homogeneous retrieval access) across repositories can take place. A model for formalising metadata [Je2000]  Data Science Journal, Volume 9, 24 July 2010 even ORE (Object Re-use and Exchange (http://www.openarchives.org/ore/) as packaged metadata. Examples of such an approach have been utilised in the DELOS project and NoE (http://www.delos.info/) and DRIVER followed by DRIVER II (http://www.driverrepository.eu/). However, it is the experience of the authors that this is insufficient to meet the requirement. The problems concern metadata. The DC 15-element standard does not have a sufficiently formal syntax nor declared semantics for effective processing. Although DC has been extended (Qualified DC) to improve the situation and recent work (2008) has extended DC with domains and ranges, this does not overcome the problem. This may be characterised as the need for machine-understandability as well as machine-readability of the metadata. Furthermore, the research output should be understood in context, that is, the publication or research dataset related to the research projects, persons and their roles, organisational units, funding, research facilities and equipment, etc involved in the research that generated the output. One example should suffice to explain the difficulty using DC. The element contributor is defined at http://purl.org/dc/elements/1.1/contributor as: Label: Contributor Definition: An entity responsible for making contributions to the resource Comment: Examples of a Contributor include a person, an organization, or a service. Typically, the name of a Contributor should be used to indicate the entity.
The example illustrates exactly the problem: the 'type' of the element contributor is not defined (although with namespaces, domains, and ranges, a limited set of acceptable lexical terms can be defined). The kinds of contributor in the example would likely have different legal status, rights, and responsibilities. There is no concept of the relationship between contributors (except that there is, confusingly, another element named creator, the definition of which has a comment exactly as for contributor). Since 1999 (early in the life of DC) these criticisms have been made by members of euroCRIS and the alternative approach based on formal syntax and defined semantics (described below) proposed.
In fact, the DRIVER consortium itself in a public paper http://www.driversupport.eu/documents/DRIVER_Review_of_Technical_Standards.pdf criticises DC as unsuitable, criticises other formats (such as MODS), and even mentions CRIS and CERIF. A later (and very recent) paper http://www.driversupport.eu/documents/D4%203_Tech_Watch.pdf is more specific on the need for CERIF-like metadata and lifts much information from the euroCRIS website in section 2.3. Despite this later paper recommending integration of CRIS and OAR, there is no technical proposal of how this should be done. Thus, unfortunately these papers misunderstand and misrepresent the CERIF form and function, basically not comprehending the concept of semantically typed n:m relationships between base entities (objects) such as persons and publications recorded using a formal syntax.
Similarly the Knowledge Exchange http://www.knowledge-exchange.info/ (which has an intersection of members with DRIVER) has considered the relationship between CRIS and repositories and even initiated a project (2008-2010) on this. This project was instigated following a meeting where a euroCRIS member presented the case for integrating CRIS and repositories. Unfortunately again this project misunderstands and misrepresents the form and function of CERIF-CRIS in the same way as DRIVER. Both DRIVER and Knowledge Exchange have claimed there is no method for interoperation between a CRIS and OA repository; in fact within the euroCRIS community there are several working examples, and the solution proposed below is based on the euroCRIS members' experience of this. In the later driver paper http://www.driversupport.eu/documents/D4%203_Tech_Watch.pdf , some case studies indicate in overview that such a linkage is possible thus contradicting the earlier statement.
Thus, the current DC metadata standards (DC) and (OAI-PMH) for interoperability are insufficient for scalable, automated retrieval with appropriate relevance (precision) and recall. DC is machine-readable but not machineunderstandable. One basic problem is that a formalised syntax and semantics (vocabulary) for each relevant DC element was not specified in 'simple DC' and has only partially been overcome by the use of namespaces in 'qualified DC' as illustrated above. A second problem concerns the element set tags 'contributor,' 'creator,' and 'publisher,' which are actually roles of a person or organisational unit and should be represented by a relationship (between the article and the person or organisational unit) where the role belongs to a namespace and is temporally limited. A third problem is the tag 'relation,' which is extremely general; the real world is much better modelled through typed relations with role and temporal validity. Other problems include the tag 'coverage,' which only recently has been separated into temporal and spatial aspects yet these are fundamental retrieval criteria for much material. A formalised version of DC overcoming these limitations has been suggested [Je99b] and defined [AsJe2005] to form also part of the CERIF model allowing tight integration with CRIS. Recently the DC community has recognised these problems and with more recent work [DCAM2007] [DCRDF2007] is attempting to address them.
To ensure that research output material is available for future generations, curation and preservation issues must be addressed. There is current work to define metadata standards to achieve this [OAIS] but a major problem concerns maintaining the articles on current (i.e., usable) media.

Linking CRIS and institutional repositories
The linking together at an institution of a 'green' OA repository of articles, a CRIS (to provide contextual information), and an OA repository of research datasets and software [JeAs2006a] {Figure 2Figure 2} ensures that an institution can manage its IP for benefit whether that benefit is in innovation and investment, in educational resources, in stimulation of future research, or in publicity. Furthermore, the formalised structure of the CRIS allows a reliable workflow to be engineered, which in turn encourages deposit of research outputs. Such a system is being implemented progressively at STFC Rutherford Appleton Laboratory where the CERIF-CRIS is named the Corporate Data Repository, the OA repository is ePubs, and the e-research repository is the e-Science repository.
Linking together these institutional CRIS systems (which have a formal structure and hence can be interoperated reliably and in a scalable way), [Je2005] provides a network of access to institutional OA repositories (of articles) or e-research repositories via the CERIF-CRIS gateways enhancing and controlling the access using the CERIF-CRIS information as formalised, structured, and contextual metadata, which is more detailed than DC and suitable for intelligent (machine-understandable) interoperation {Figure 3}. Interoperation of CERIF-CRIS has been demonstrated, most recently for euroHORCS (European Heads of Research Councils) in October 2006. However, as yet, the whole architecture has not been demonstrated.
The key point is that the metadata for a publication (or dataset) is stored in the CRIS in formal syntax and with defined semantics, and the repository just acts as a deposit space. In this way management information and analysis can be done using the (formal) CRIS while retrieval of the individual publication (or dataset) is done through the repository sanctioned by the CRIS. The repository may or may not also store metadata (usually in DC, for OAI-PMH interoperation and OAISTER retrieval), but this metadata is best generated from the CRIS. This is because the CRIS in a research institution is intimately linked to the researcher workbench, and organizational workflow and much of the metadata required for a publication (author, institution, rights) is already stored in the CRIS and does not need the author to re-input.

FUTURE
Looking to the future speculatively, it is possible to imagine 'green' OA repositories becoming commonplace and used heavily. At that point, some argue, one could change the business model so that an author deposits in an open access 'green' repository but instead of submitting in parallel to a journal or conference peer-review process, the peer-review is done either by: a) a learned society managing a 'college' of experts and the reviewing process, for a fee paid by the institution of the author or the author or Data Science Journal, Volume 9, 24 July 2010 b) allowing annotation by any reader (with digital signature to ensure identification / authentication); in both cases being alerted by 'push technology' that a new article matching their interest profile has been deposited.
The former peer-review mechanism would maintain learned societies in business, would still cost the institution of the author or the author, but would probably be less expensive than publisher subscriptions or 'gold' (author or author institution pays) open access. The latter is much more adventurous and in the spirit of the internet; in a charming way it somehow recaptures the scholarly process of two centuries ago (initial draft, open discussion, revision, and publication) in a modern world context. It is this possible future that is feared by commercial publishers.