Ask-Me-Anything: Open & Reproducible Data Science

In the spirit of Love Data Week 2023 (#lovedata23) I asked people to send in questions they have on Open and Reproducible Data Science (AMA). Here are my answers.

Format:

Question?

Answer.

How to figure out where to start with reproducible research and what are the ways to get good data for reproducible research?

Very good questions.

Where to start? I recommend starting with the things that are easy and fun for you. Reproducible research requires continued learning, but you don't have to be perfect right away. Here are some ideas where you could start (order does not matter).

Reproducible Research: 6 helpful steps. (1) Get your files + folders in order; (2) Use good names for files, folders, functions, …; (3) Document with care: README, Metadata, code comments, …; (4) Version control code, text, …; (5) Stabilize computing environment and software; (6) Publish your research outputs: Code, data, documents, …

Ways to get good data? I am not 100% sure if I understand the question correctly. I interpret it this way: "How can I ensure my data is of high quality so that my research is of high quality, too?". There are two paths:

If you are involved in data collection: make sure to plan well from the beginning (work with a data management plan) and ensure you work with someone (e.g. a statistician) who knows what data needs to be collected in order to answer the question you have.
If you are not involved in data collection: ensure that the data was collected in a way that it can actually answer your question. It's often necessary to be able to ask questions about the data collection process and the content expertise of the original data collectors.

What are some general approaches or tenets (in case of software engineering) one should keep in mind while approaching open data science?

That's a big question. Here are a few thoughts:

Doing good research is a process. You don't have to be perfect, but if you try to do good research, you will get better and better over time.
Good organisation is more important that using the fanciest tool out there. Version control is the tool that helped me personally the most.
Learning good coding practices means becoming better at research.
You will never feel like your code is "good enough for making it openly available". I still don't feel like I write good code, but I got used to being vulnerable with my code 😅 and open source is the way to go.

Why is it so hard to produce reproducible research, is it the nature of data or nature of experiments? I have seen plenty of papers claiming astounding results but plenty of times its hard to verify their claims even they open source their code. In what way can researchers and Software engineers verify if their research is reproducible.

I think it is the nature of research itself. Research is hard and thus doing research well is hard.

If you want to have your research checked for reproducibility, there are several ways:

Code reviews from peers (e.g. from your research group)
Submit to a journal with reproducibility checks (e.g. Journal of Statistical Software or Biometrical Journal) or one that collaborates with CODECHECK.
Submit your project to ReproHack.

As it may be increasingly required by journals or funders, some scientists may perceive code and data sharing as an additional administrative burden and are tempted to cut corners by uploading "some files" without any real value. Several studies have demonstrated, that even if data and code are shared, results can still often not be reproduced (for a recent example from psychology, see e.g., Crüwell et al., doi: 10.1177/09567976221140828). How worried should we be about code and data sharing that is not providing value and is trying to "tick a box". More practically, which mechanisms ensure that shared resources are actually relevant and useful and how checks or enforces these procedures?

I would already be happy if researchers were at the point that they would share their code and data. That is not the case for most research today. For the others I would already say "congrats and thanks", even if it is not fully or easily reproducible. I agree though, that even when code and data are available that does not automatically make the project reproducible. I used to do reproducibility checks for the Journal of Statistical Software and that really is work that is extremely hard to automate (I tried 😅). So in some way we need to have people who do these check and likely they will have to be paid by the journals (IMHO). Initiative like CODECHECK and ReproHack are in that space as well and that is why I love their work so much. So please check it out and contribute if you can 🙌

How can we ensure computational reproducibility of large research projects that may involve terabytes of data and thousands lines of code across various programming languages? What are the tools and infrastructures that can be used?

The more complex a project becomes, the more we need to think about project management. That includes good organization (good folder organization, file naming, READMEs, ...) as well as good communication among the team. In projects with various programming languages I like using tools like Make, that help with keeping track of the analysis pipeline and work both on the cluster and your local machine.

When we talk about reproducible data and data provenance, are there already models for the dependencies between data? For example, if we consider data provenance, a simplistic view is that you have a set of input data and and output data. But when you consider a large amount of data, then the dependency graph can be quite complicated. And there are a lot of questions which do not have obvious answers. For example, if we incidentally use a software (a fictional software AlphaDe for example) to do some benchmarking and write a report, and then use that report to improve the performance of AlphaDe. The benchmark results depend on Alpha De and the report then depends on benchmark results and the new changes in Alpha De depend on the report. Is there then a circular dependency of AlphaDe on itself? Or is it more useful to attach a version number to everything here? Do you think answering such questions would be extremely helpful and I was wondering if there is already some research in this direction.

Do you know any data management software which has a well-defined data and dependency model as mentioned above?

I agree that with increasing complexity of dependencies things become more and more complicated. So far I've dealt well by doing data processing using code (in my case R) in combination with good folder organization and documentation, literate programming (RMarkdown) and automation (Make). If you need something more advanced than this, I recommend checking out DataLad. I haven't used it yet, but it sounds very promising with regards to your question.

Would you mind sharing your plan for rebooting academia? My understanding is that you have what it takes to excel in academia but you decided to leave it because you realized that the overlap between academic research and scientific research is diminishing and you care more about actually discovering something useful than securing a grant to write a thrilling story about how the next grant will save our species. You may be bothered by the fact that billions of dollars dedicated to "scientific progress" is being wasted on "academic promotions". I have no doubt that you can easily secure some funds to start a movement and establish yourself as a respected advocate of everything open, but I cannot see how that can fix academia. My assumptions about your perspectives may be wrong, but that's my take as an outsider. I guess I'm just looking for a savior to come down and save us all from academia and I just wanna believe that you are that savior! :D

Wow that is both a very kind compliment and a lot of pressure on my tiny shoulders 😅.

My personal goal is to contribute to research quality in the ways that I can. I started out by trying to make my own research as good a possible, then by helping others to do the same and by advocating for a better system. For a while I even tried starting a new University (that project was shut down).

On a broader picture I see myself as a small puzzle piece in the endeavor of improving research quality. No single person can fix academia or fix research. But I do believe that there are enough of us out there who can work together and change a lot. I already see a lot of changes, e.g. funders who require open access, publishers who promote open data, and research institutions who change the way the assess research(er) quality.

Does the anonymous discussion system that you have work? Sometimes you post links for people to share their thoughts or ask questions anonymously in a document. But I assume there are always bad actors and trolls who would delete other people's text or write inappropriate stuff. I don't understand how you can pull that off without being bothered by them!

So far, I have been really lucky. I rarely get fired on by trolls nor do they post or delete things in e.g. pads or Miro boards that I open for everyone to contribute to anonymously. I am really thankful for this and believe that the Open Science community deserves a good and welcoming reputation (most of the time 😉). I hope this will continue this way.

Side note: I do sometimes get aggravated by replies that I receive on social media. Most of them are due to mansplaining and I try to either ignore them or make it really clear that I am an expert in my field 💪

A more personal question: How is your job experience as an Open Science Consultant so far? Do you think you have more impact on open science in academia this way? And do you think it's a viable career path for others to follow on?

I am really glad that I do what I do now. It was a wiggly path to where I am now and it was hard. But now I feel good about it.

I do think that being able to focus on Open Science helps me have a bigger impact. Not having to worry about writing papers and being evaluated based on some standardized output, allows me to be creative and focus on the things that matter (to me).

I would have stayed in academia if it had been made easier/possible for me, but it wasn't. Academic career paths are not yet flexible enough for the work I want to do.

Is it a viable career path? Well, for me it is and it seem like it is too for other Open Science Freelancers. But it really depends on what you want. I recommend talking to someone about it who just listens and asks the right questions (instead of giving recommendations). That can be a friend, a mentor, or a coach (e.g. SkillsWeaving). In the end you need to make the right decision for yourself, not the decision that makes others feel good.

What can we do to help you in your mission? Thank you!

🫶🫶🫶

Thank you for asking!

These three things help the mission of improving research quality:

Stay curious and stand with your values. If you got into research to improve the world a little bit, do that! Don't let the system change you into a career-optimization-machine.
Talk Open Science with your peers. Share useful learnings, posts, readings, .... Also share your burdens and worries.
Get involved. Join an Open Science initiative, Open (Source) project, Open Access journal, ...

And if you want to help me personally:

Share my work. Forward my newsletters, share what you like about my trainings on social media, recommend my YouTube channel to your friends.
Get in touch. Let's see how we can collaborate and work towards better research together.

Thanks for all your wonderful questions! I hope my answers are useful 🤓

All the best,

Heidi

P.S. If this was useful to you, please consider supporting my work by leaving a tip.

Send a tip

Heidi Seibold, MUCBOOK Clubhouse, Elsenheimerstr. 48, Munich, 81375
Unsubscribe · Preferences · My newsletters are licensed under CC-BY 4.0

Dr. Heidi Seibold

Ask-Me-Anything: Open & Reproducible Data Science

Dr. Heidi Seibold

3 simple rues for creating Open Science Policies

Open Science Communication: From Academia to Industry

Setting up a FAIR and reproducible project