s12e09: Actually, it's about the ethics in dataset collection, attribution and licensing in machine learning

                        May 24, 2022

            s12e09: Actually, it's about the ethics in dataset collection, attribution and licensing in machine learning

                    0.0 Context Setting
It’s Tuesday, 24 May 2022 and a grey, overcast day in Portland, Oregon.
Reading: No One Is Talking About This Book by Patricia Lockwood which, fittingly, I can only respond to with a sort of this. 

1.0 Some Things That Caught My Attention
Some shorts today:
Morally Motivated Networked Harassment as Normative Reinforcement is Alice E. Marwick‘s paper from June 3, 2021 that puts forward a framework to explain “networked harassment on social media”, tying behavior to moral outrage. 
Caught my attention because: it’s a paper about online community behavior and abuse, of course it’s going to be interesting to me. (And likely it’s interesting to you, whether as a subject of such abuse (I’m sorry) or if you have any influence in advocating for or putting in place interventions to decrease prevalence.

The UK Government’s advice on documenting APIs is some more government-focussed advice on API documentation which I am noting here for my attention/interest thanks to certain government initiatives to move to API-based applications and their relative success and progress at doing so.

A team at UC Berkeley created the Berkeley Crossword Solver¹, which uses a neural network to solve crossword puzzles. 
Caught my attention because: a big deal because it’s not your usual text/image generation/comprehension model. Sure, it’s American crosswords, not British cryptic crosswords, but it invovles some sort of semantic understanding (as it were) of content. But my big question was about the training dataset, and sure enough: 

We collected a dataset of over six million question- answer pairs from top online publishers such as The New York Times, The LA Times, and USA Today.

I’ll cut to the end - this data is copyrighted. Sure, it’s available for free, but it’s still under copyright by its owners. I’m not saying that screen-scraping is illegal, I’m not saying whether it’s good or bad, but what I am saying is that the question/answer pairs of crosswords are high-quality data, which the authors admit themselves:

Crosswords are also useful from a practical perspective as the data is abundant, well- validated, diverse, and constantly evolving. In particular, there are millions of question-answer pairs online, and unlike crowdsourced datasets that are often rife with artifacts (Gururangan et al., 2018; Min et al., 2019), crossword clues are written and validated by experts.

So, you’ve got an expert-written, curated and checked dataset. What else is valuable it?

crossword data is diverse as it spans many years of pop culture, is written by thousands of different constructors, and contains various publisher-specific idiosyncrasies

and:

Compared to existing QA datasets, our crossword dataset represents a unique and chal- lenging testbed as it is large and carefully labeled, is varied in authorship, spans over 70 years of pop culture, and contains examples that are difficult for even expert humans. We built validation and test sets by splitting off every question-answer pair used in the 2020 and 2021 NYT puzzles. We use re- cent NYT puzzles for evaluation because the NYT is the most popular and well-validated crossword publisher, and because using newer puzzles helps to evaluate temporal distribution shift.

So: super valuable dataset. Used for free, sure, in an academic context. The code for the model is available on github, natch. Now, you’d need the training data to replicate, right? I mean, you can test the model but it would be good to have access to the data. The paper implies the dataset is available on github, but I couldn’t find it. Which is all besides the point: 
It’s one thing to use creative commons licensed data for training. It’s another to do wholesale scraping of, e.g. Reddit forums without sufficient anonymization or scrubbing. And then it’s another to assert ownership of that dataset (which I’m not implying here). But in the UK at least (and things may have changed a lot) there used to be database rights, and look: it would be better, would it not, if the crossword setters had licensed their data explicitly? They could have licensed it for free, even – there’s no requirement for there to be compensation. 
Anyway. Super annoying. Clearly because I’ve written this in 15 minutes I’m just putting this down as a marker to do more reading into the ethics in dataset collection, attribution and licensing in machine learning. 

Okay, that’s it for today. Actually only 15 minutes.
How are you?
Best,
Dan

Supporting this newsletter
This newsletter is and will remain free. Here’s some ways you can support it: 

Upgrade your free subscription to become a paid supporter and get a free copy of Things That Caught My Attention, Volume 1
… or buy a copy of Things That Caught My Attention, Volume 1, collecting the best essays from the first 50 episodes. Free subscribers get a copy at 20% off
Or be a fantastic my-boss-is-paying supporter, and subscribe as a work expense: 
$25/month, or $270/year
$35/month, or $380/year
$50/month, or $500/year 

Automated Crossword Solving, arXiv:2205.09665, Eric Wallace, Nicholas Tomlin, Albert Xu, Kevin Yang, Eshaan Pathak, Matthew Ginsberg, Dan Klein, what a horrible citation format I just did ↩

Don't miss what's next. Subscribe to Things That Caught My Attention: