Ethics in data mining

BIld av Tumisu från Pixabay

https://link.springer.com/article/10.1007/s10664-021-10057-7

A lot of software engineering research studies use open source data and mine software repositories. It’s a common practice since it allows to test our hypotheses before asking for previous resources from our collaborating companies. By mining open source data we can also learn whether our study makes sense; we can see it as a pilot study of some sorts.

Mining software repositories has evolved into a popular activity since we got access to repositories like Github. There are even guidelines for assessing this kind of studies, e.g., https://sigsoft.org/EmpiricalStandards/docs/ and we have regulations of what we can do with the open source data – these can be in the form of a license, law (like GDPR or the CCPA) or the need for asking an ethical board for an approval. However, there is also a common sense – not everything that is legal is appropriate or ethical. We always need to ensure that no individual can be a subject to any harm as a result of our actions.

In the article that I want to bring up today, the authors discuss the ethical frameworks for ethical software engineering studies based on open source repositories. We need to make sure that:

  1. We respect the persons, which stresses the need for approval and consent.
  2. Beneficence, which means that we need to minimize the harm, but maximize the benefit.
  3. Justice, which means that we need to consider each individual equally.
  4. Respect for law and public interest, which entails conducting due diligence on which data we can use and in which way.

The most interesting part of this article is the analysis of different cases of mining software repositories. For example, the case of analyzing the code, reviews, commit messages and other types of data in the repositories.

I recommend this article for everyone who considers working with mining software repositories.

Author: Miroslaw Staron

I’m professor in Software Engineering at IT faculty. I usually blog about interesting articles (for me) and my own reflections on the development of Software Engineering, AI, computer science and automotive software.