This page complies guidelines and good practices for open research data sharing.
Why share my research data and materials?
Sharing your research data and research materials promotes Open Science (OS) principles, which universities and research institutes widely endorse (UNESCO 2021). Open Science drives innovation for and from the scientific community, enhances quality of research, and ensures that scientific efforts lead to real-world benefits, beyond academia.
Sharing your own research data and materials has multiple benefits. In general, sharing increases collaboration and impact of your research, and promotes your scientific career and networking opportunities for the future.
Key arguments for sharing of research data and materials are:
- Sharing promotes transparency, reproducibility, and verification of scientific results. When other scientists can see and access your data sets or source code this promotes honesty and openness in research, and supports validation and reuse of your research methods and results.
- Sharing increases citation, recognition and collaboration. Shared source code and data usually lead to more citations of the original work. Researchers who share their research materials are often seen as leaders in their field, enhancing their reputation and contributing to the scientific community. Sharing makes you a prominent collaborator to other researchers and research groups and strengthens your scientific networks.
- Sharing increases the value and impact of your research. Shared source code and data can be utilised by industries, non-profits, and government agencies, leading to practical applications and innovations that benefit society. When other scientists and industries can use shared source code and data, this enhances discoveries of new methods and solutions, and helps to avoid duplication of efforts.
Open Science practices have become an essential part of the scientific publication processes, networking activities and dissemination. Many conferences, journals and funders obligate your research data and source code to be shared openly. For example, the Reproducibility Committee of the Association of Geographic Information Laboratories in Europe (AGILE), who checks, as a part of the peer review process, all submissions to the AGILE conferences from the reproducibility perspective (AGILE 2024).
How to make metadata for geospatial data?
Metadata describes your data – how, when, where and why it was created. When metadata is well described, it helps others to understand your data, and reuse it. Metadata guides often focus on data publication and what kind of metadata helps in such cases. For research data the requirements for metadata can be different.
Metadata aims to describe the nature of your data, including how the data was produced. Therefore, it is necessary to start creating metadata for datasets in the very beginning of your research process, and compile the metadata simultaneously with the data creation process. This strengthens metadata reliability when the data lineage is still fresh in your mind and essential decisions and data evolution phases can be described in the metadata.
Metadata is an instrumental element of data sharing and thus it is recommended to use a metadata standard suitable for your data type. For geospatial data, the most common metadata standard is the ISO 19115-1:2014. The European Union INSPIRE metadata guideline (Technical Guidance for the implementation of INSPIRE dataset and service metadata based on ISO/TS 19139:2007) could also be used. Often data sharing platforms have instructions about the metadata. Therefore, if you already know the data sharing platform you will use, check the metadata instructions and follow them.
If you are unfamiliar with how spatial metadata is structured, we suggest you start with the INSPIRE metadata documentation, due to open availability. Also, the metadata document linked to here is a metadata guideline, and therefore contains more explanations than the ISO 19115 standard document, which is focused on definitions.
You should consider including the following basic metadata elements in your metadata description. Not all of these are required for every dataset, and you should always think carefully which are useful, and which are not.
Dataset identification information | – Dataset title – Abstract – Keywords – Data categories | This information provides the user with basic knowledge what your dataset is about |
Spatial characteristics | – Coordinate reference system – Projection (and datum) – Spatial and temporal extent – Scale/resolution | These elements describe the basic spatial characteristics of your data. However, when writing this information to a metadata reference, remember that typically many of these elements are included in the dataset, and take this into account |
Data model | – Data geometry information – Attribute information – NoData value for rasters – Language used in the dataset | This information helps the user to interpret your data easier |
Data quality | – Known errors in the dataset – Positional precision and accuracy – Thematic and temporal quality – Completeness – Consistency | There are several standards available for measuring and characterizing data quality, including the ISO 19157-1:2023 data quality standard for geographic data. A good starting point for familiarizing yourself with data quality might be the data quality criteria indicators by Statistics Finland |
Data lineage and processing information | – Source datasets and their versions – Data gathering methods for datasets created in project – Processing methods and tools used to create the dataset | This information helps others to assess the suitability of the data you provide for their own needs |
Distribution information | – Data format (i.e. gpkg) – Data size – Instructions for accessing the data | This allows the users to see where the data is available, and how it can be taken into use |
Citation and license information | – Author or orgaization, publisher, publication date, version – Data license and version – Legal restrictions of use | We recommend publishing data with an open and permissive licence in order to make it useful for others |
Maintenance information | – Maintenance or update frequency – Information about what has been updated and when – Date of last revision | This information is necessary only if the dataset is going to be maintained. A lot of research data is left as-is at the end of a research project. If the data is not going to be maintained after the end of the project, this should also be mentioned |
Which repositories for data management and storage are available and which one(s) to pick?
You may need to use services and platforms to manage and store your data during your research process. When you work with other researchers, it is critical that the platform provides seamless collaboration possibilities. Below are listed a few cloud-based solutions from CSC and other organisations. Cloud storage services offer scalable, secure, and flexible environments to store and manage data, making them ideal for visualisation applications that require real-time access.
CSC storage options during project life-time, free of charge for Finnish open academic projects:
Service | Purpose | Size limit per project | Tools for access | Back-up |
---|---|---|---|---|
Allas | General storage | 200 Tb | Tools with S3 or SWIFT | No |
Supercomputers and Pouta-clouds | Storage for computing projects | 20-50 Tb | Tools for moving files to Linux | No |
SD Connect | Sensitive data | 10 Tb | Custom web interface and command-line | No |
Pukki | Database, PostGIS | 50 Gb | PostgreSQL clients | Yes |
In addition to Geoportti RI services, there are various of commercial services that can be used to store data:
Why and how should I license my research data?
Useful resource: https://book.the-turing-way.org/reproducible-research/licensing
Releasing research data always requires licensing. With licences, data owners can determine how the shared data can be used further. Unlicensed data cannot legally be reused, and thus sharing your data and licensing it goes hand-in-hand.
When you share your data according to Open Science principles, you should use Open licences. However, keep the slogan “as open as possible, as closed as necessary” in mind when working with e.g. personal or sensitive data.
- Open licences = Open licences are compliant with Open Access practices and FAIR data principles. Open licences allow others to use and build upon your work.
Your research institute might already list recommended, internationally standardised, machine-readable open licences for research data sharing. In many cases, these are the Creative Commons licences.
- Creative Commons provides a handy Licence chooser -tool that helps you to determine which version of the CC-licences suits your case best.
Note! Not all Creative Commons licences are compliant with Open Access definition. For example, if you choose to prohibit commercial use of your data (CC-Non-Commercial /CC-NC), your data sharing is no longer Open Access compliant. Your research institute, data sharing service, funder, or journal might require using Open Access compliant licences.
Note! The type of your data product may require using other licences. For example, if you alter or build upon OpenStreetMap data and re-share it, you must release it under the Open Data Commons Open Database Licence (ODbL).
How to share code with others?
In order to preserve the steps involved in processing and analysing your research data (geospatial or not), you may produce some code along the way. Let’s look at some good practices when working with and sharing research code:
Version control – tools like git allow you to track changes throughout your development journey and allow you to “go back in time” to a working version in case of unforeseen problems or mistakes. They allow collaboration with yourself and others (if working online) by providing ways of keeping the development work separate from the main codebase through branching and merging, allowing you to effectively manage your codebase. They also support conflict resolution and documentation of changes to the codebase. Moving your codebase to platforms like GitHub, GitLab or BitBucket in addition enables sharing and collaboration of your work with others.
Code documentation is essential at the latest when making your code public. It enables others to understand what your code is for and how it can be used. It can also help you keep track of where you are and what you have already done. You can start small, by using descriptive functions and variable names and adding comments to your code. README files can be sufficient for small projects, if all important information is included (see e.g. the CodeRefinery lesson on documentation for a checklist: https://coderefinery.github.io/documentation/wishlist/#creating-a-checklist). If a README is not enough, you may want to check out static site generators like Sphinx (https://www.sphinx-doc.org/en/master/) to create nice documentation web pages by writing simple markdown files.
Licensing is equally important for sharing software and code as it is for sharing data. Code without a license cannot legally be reused. Since code is different from data, and we need to consider different aspects, we need to consider other licenses. The JoinUp licensing assistant by the European Commission (https://interoperable-europe.ec.europa.eu/collection/eupl/solution/joinup-licensing-assistant/jla-find-and-compare-software-licenses) can help choosing the most suitable license for your needs. The GNU GPL licenses are commonly used copyleft licenses; this means that code under this license requires all derivatives to be under similar license, preventing for example its use in a closed source project. A common example for a permissive license is MIT.
Also remember to make it as easy as possible for others to cite your work, by for example adding a CITATION.cff file to your repository: https://citation-file-format.github.io/.
Once your code is done, you can for example publish it by itself with a software publication (example journals: JOSS (https://joss.theoj.org/), Software-X (https://www.sciencedirect.com/journal/softwarex), ) or alongside your research publication (some journals may require you to publish your code). This is best done by packaging your code and publishing it in a language specific registry like conda-forge, pypi or CRAN. You can also get a persistent identifier for your code by uploading it to a platform like Zenodo.
Useful resource: CodeRefinery lessons on tools and techniques for reproducible research: https://coderefinery.org/lessons/#lessons-that-we-teach-in-our-tools-workshops
Where to share my data?
We have compiled a table of platforms offering research data sharing services. Take a look at the table and different platforms’ characteristics when selecting the most suitable platform for you.
Repository | Discipline | File size | PID | Geospatial data visualisation | APIs | Who can publish? | Self-service publishing |
---|---|---|---|---|---|---|---|
Paituli | Spatial data only | Max ~ 2-3 TB | URN via Fairdata service | Yes | OGC APIs, STAC, metadata search via Fairdata | Finnish universities and research institutes only | No |
Fairdata (IDA, Qvain, Etsin) | Multi-disciplinary | Max ~ tens of TB | DOI or URN | No | Metadata search and updating API | Finnish universities and research institutes only | Yes |
Zenodo | Multi-disciplinary | Max 50 GB | DOI | No | Metadata search API, data upload and download APIs | Anybody | Yes |
Pangaea | Environmental sciences | DOI | No | Metadata search and tabular data download APIs | Anybody | No | |
SD Submit (In development) | Multi-disciplinary, sensitive data | No | Finnish universities and research institutes only | Yes | |||
Avoindata | Multi-disciplinary | 5GB (?) | No | No | Metadata search and data download APIs | Anybody, mainly Finnish governmental organizations and municipalities | Yes |
UTU Geospatial data service | Spatial data only | No | Yes | OGC APIs | UTU | Yes | |
SYKE open data | Mainly spatial data | No | No | OGC APIs, metadata search API | SYKE | No | |
Hugging Face | General AI/ML data | No | No | Metadata search, data download API | Anybody | Yes | |
Kaggle | General AI/ML data | Yes | No | Metadata search, data upload and download API | Anybody | Yes | |
OpenML | General AI/ML data | No | No | Metadata search, data upload and download API | Anybody | Yes |
How to prepare your data for publishing?
- Technical recommendations:
- Format: Cloud-optimized Geotiff for raster, GeoPackage for vector data, LAZ for laser scanning data, NetCDF or Zarr for multi-dimensional data.
- Coordinate system: ETRS-TM35FIN (EPSG:3067) for data about Finland, WGS-84 (EPGS:4326) for global data.
- Divide data to parts (map sheets) if file size exceeds ~5Gb.
- Make sure that the files have coordinate system information included in the file and at least for raster data no-data values defined.
- Dataset organisation: The importance of organising your data in a meaningful manner so that it is easy to be used by others.
Check ready material from Turing Way
Sensitive data. When your data contains personal or sensitive information of research subjects, open sharing as such is not an option. Same goes for sensitive data on environment and the infrastructure, such as ICU red-listed species and geomorphological features of the sea bed in Finland. You might need to perform anonymisation processes for the data before sharing, and have GDPR compliant permission form people the data covers. Other measures are to share simulations of the data, and to use closed licences. In case sharing your data in any way is not an option, you may still publish metadata of your sensitive research data.
Big data. When you share massive datasets you might run into file size restrictions on data sharing platforms, or requirements for additional costs to host big data. Think also about the usability of shared big data – should additional code for downloading or using the data on cloud servers be added to the metadata? Alternatives are to share subsets or samples of the full data on public platforms, and provide and alternative method to access the full data from other servers.
How to accelerate data sharing impacts?
Creating data visualisations and other reader-friendly products like graphs, interactive maps, dashboards, and data stories is a great way to accelerate further usage and impact of your shared and published research materials. Data visualisations and stories help others to comprehend what the data is about, understand the phenomena it represents, get insights on interesting aspects the data entails, and get inspired to read more about your work. Visualisations and stories bring your work closer to the general public, and thus increase your audience.
Below are a few examples of tools, techniques and additional resources to use for increasing the usability and impact of your research outputs.
General resources:
Geospatial resources:
Interactive applications / dashboards:
- Python: Streamlit, Dash/Plotly, Pydeck, Solara
- R: Shiny
Story Maps:
How to make a data and software publication?
Consider writing a paper about your research data and publishing it in a peer-reviewed journal dedicated for data publications. Data papers are a great way to increase professional visibility, especially for young scientists.
Examples of journals dedicated for data and software publications:
- Nature Scientific Data: https://www.nature.com/sdata/
- Data in Brief: https://www.sciencedirect.com/journal/data-in-brief
- Journal of Open Source Software: https://joss.theoj.org/
- Environment and Planning B: Urban Data and Code: https://journals.sagepub.com/doi/full/10.1177/23998083211059670
Examples of data publications:
- Helsinki travel matrix
- Green?
- University of Helsinki, Spectre
LiPheStream: https://www.nature.com/articles/s41597-024-04143-w