Recently I read a paper about automatic detection of suicide-related tweets: A machine learning approach predicts future risk to suicidal ideation from social media data.
The authors of this study, Arunima Roy and colleagues, trained neural network models to detect suicidal thoughts and reported suicide attempts in tweets.
When I reached the end of the article, I spotted that the authors state that the code they used to carry out the study is too ethically sensitive to make public:
Code availability: Due to the sensitive and potentially stigmatizing nature of this tool, code used for algorithm generation or implementation on individual Twitter profiles will not be made publicly available.
Given that the paper describes an algorithm that could scan Twitter and identify suicidal people, it’s not hard to imagine ways in which it could be misused.
In this post I want to examine this kind of “ethical non-sharing” of code. This paper is just one example of a broader phenomenon — I’m not trying to single out these authors.
It’s generally accepted that researchers should share their code and other materials where possible, because this helps readers and fellow scientists to understand, evaluate and build on it.
I think that sharing code is almost always the right thing to do, but there are cases where not sharing could be justified, and a Twitter suicide detector is surely one of them. I really think it could be abused.
To me, the key question is this: Who gets to decide whether code should be published?
Currently, the authors make that call themselves, as far as I can see, although the journal editors have to endorse that decision by publishing the paper. But this is an unusual situation — in other areas of science, scientists don’t serve as their own ethicists.
There are ethical review committees responsible for approving pretty much all research that involves experimenting or gathering data on humans (and many animals). But the Twitter suicide study didn’t need approval, because it involved the analysis of an existing dataset (from Twitter), and this kind of work is usually exempt from ethical oversight.
It seems to me that questions about the ethics of code should be within the remit of an ethical review committee. Leaving the decision up to authors opens the door to conflicts of interest. For instance, sometimes researchers have plans to monetize their code, in which case they might be tempted to use ethics as a pretext for not sharing it, when they are really motivated by financial considerations.
With the best will in the world, authors might simply fail to consider possible misuses of their own code. A scientist working on a project can be so focused on the possible good it could do, that they lose sight of the potential for harm.
Overall, while it would mean more paperwork, I’d be much more comfortable having decisions about software ethics made by an independent committee.