s16e06: Tech Journalism, But Not Tech Journalism; Sounds More Like A You Problem

interesting

                October 2, 2023

            s16e06: Tech Journalism, But Not Tech Journalism; Sounds More Like A You Problem

            0.0 Context Setting
A grey Monday, October 2nd 2023 in Portland, Oregon. 
I'm open for my next block of consulting work and looking for interesting people to work with and interesting problems to solve. Unless you're in that special class of people who're ready to go, the best way to get started is for us to have a quick 30 minute chat, or drop me a line at get@verylittlegravitas.com.
A couple bigger things today, then one small thing.

1.0 Some Things That Caught My Attention
1.1 Tech Journalism, But Not Tech Journalism
The Markup published a piece today about how predictive policing is terrible at predictions. To ensure the lede isn't buried and instead put front and center, the story's subhead claims the algorithm used in a New Jersey police department was right...
... oh, I bet you can't guess.
I mean, I bet it's lower than your guess.
Less than 1% of the time! ¹
The Markup is one of my favorite news outlets because it uses technology to do journalism. Which is a clumsy phrase, but let me just quote from them:

Our approach is scientific: We build datasets from scratch, bulletproof our reporting, and show our work. We call this The Markup Method.

That method is essentially hypothesis driven, based on collecting datasets, external validation and peer review, and then publishing datasets and code. So alongside this story about predictive policing, there's a companion piece writeup of what they did and how they did it², which of course includes the code on Github³. 
Two things here that caught my attention: the first is that this is an in-depth investigation of the claims made by Geolitica (formerly known as Predpol, which is... certainly an interesting name for a company) about its predictive policing software. The deal is that a police department would use its software to figure out where to to place police presence, Geolitica's hypothesis being that if there are police around, then crime is less likely to happen. 
So the story is about the ineffectiveness of the software at reducing crime, because the point of figuring out where to place police in order to deter crime in the first place. How would you know if you're getting value for money? How would know if it's even working? One of the most insightful comments from a police department that used the software essentially says that: a) they didn't think the software's predictions helped them reduce crime; b) that they didn't even use it often "if at all".
And then the point here is the general one, which is the technology and software is rarely the problem. There are likely so many other things that could be done to reduce crime, or even better, have been proven to reduce crime, rather than the claims of a software company. So again: how would you know? How would you check? This software can cost $20,500 for its first year and then $15,500, so that first year probably includes onboarding, setup and customization.
I mean, there's nothing wrong in making better decisions about where to put your resources. That would of course require you to be pretty clear about what those resources are supposed to achieve, too. And then, forever how much as I dislike it, you'd want to be sure that whatever this new method of placing resource is, it's (significantly?) better than what you're doing right now.
This, of course, would rely on data, and eh, do I have thoughts on "data" and "using data to make decisions". 
The second thing that caught my attention about this was a reminder that The Markup exists. I was worried about them earlier because there were... issues around how its funders saw The Markup's mission in what looks like messy startup drama handled badly by adults (none of the founders appear to be full-time at The Markup anymore). 
The Markup are probably the best example of how to hold the use of technology to account in society and publishing that information so society can make better-informed decisions. The good news is that The Markup aren't the only ones:

Just the other week, The Atlantic published a tool for searching the Books3 dataset⁴ used to train generative LLMs, although the not-so-great news is that the search tool is behind The Atlantic's paywall.
Back in April, The Washington Post analyzed the Google C4 dataset with the Allen Institute for AI to understand the sources for that data, as well as providing a search engine⁵.

This is important. The thing is, The Markup is a non-profit newsroom, which means it likely relies on institutional funding, as well as donations. They're a team of 21 people, which isn't cheap (and shouldn't be). Because they're a small, specialist outlet, it would be hard for them to get attention, so it makes sense that they've partnered with better known, more widely-read publications like WIRED -- their mission is to get the information and the data out. I would describe them with a term I've just made up, like a social benefit newsroom, as distinct from one that's profit-oriented (or, rather, more profit oriented).
This type of reporting ends up in lots of places. I expect that one of the places it ends up is the (better) offices of elected representatives as part of briefings, or as material used by regulatory bodies or legislative committees. I suppose the issue of spreading awareness these days is different as well, given how much content stealing and summarizing goes on (in many cases without ethical attribution), even in otherwise reputable outlets. Race to the bottom and all that.
I casually mentioned in a previous episode about the chap in Austria who single-handedly used scraping to bring attention to price-fixing. That's one person, in an environment conducive or able to listen and, hopefully, act.
If I were to turn this around, then think about this: if private businesses invest in analytics and data to make decisions, and they do invest a lot, then when we talk about government "making decisions with data", that approach must extend to regulatory action. Otherwise there's a stupendous capability gap. If you can't understand what the market's doing, if you can't understand what actors are doing to with one degree of magnitude of what they're doing, then what hope do you have? Sure, adopt a laissez-faire approach, but if you're going to do that, at least watch closely.

1.2 Sounds More Like A You Problem
Three things via Simon Willison's weblog:

Finding Bathroom Faucets using Embeddings
Emily Bender on replacing the term AI with the word "automation"
Andrej Karpathy on LLMs as "a whole new computing paradigm"

and one bit that caught my attention from a stream somewhere, about RLHF, reinforcement learning from human feedback⁷. 
First, the thing about RHLF. There was a post or a tweet or whatever about the unreasonable effectiveness of RLHF and how RLHF algorithms are super awesome and behind all the phenomenal [citation needed] performance of this generation's automated text generation engines, i.e. ChatGPT and so on.
Fine, RLHF algorithms appear to have been super successful. But I want to point out that the reason why they've been successful is not the reinforcement learning algorithm part, but the economics of the human feedback part. It's a combination of:

automated access to humans
at low cost
when you also have lots and lots and lots of money

That's it. You could have totally done this before, I think, but it's scale that's possible because of combination of those three items above, of which the latter two I think are the most material. It's not the unreasonable effectiveness of the algorithm. It's the money. (Which is, I recognize, not an entirely new observation...)
Anyway. The second parts, once we shove aside a bunch of issues as "to be solved later by society while we get on with making the shiny new stuff, or Somebody Else's Problem".
Finding Bathroom Faucets Using Embeddings is the story of someone who wanted to find a new faucet. So! Here's what they did:

got around 20,000 images of bathroom fixtures and product data (likely descriptions, price etc), presumably by scraping them
grabbed OpenAI's CLIP, which is a way to connect images and text
"do a learning", which is my silly term for creating the embeddings (ie associations between images and text and, I think, their places in vector space)
wire it up so you can find similar faucets in vector space
... and then do it the other way round, which is to say find faucets nearby in vector space based on the text-to-image embedding/association

This is a new kind of thing. It's the kind of thing places like Amazon would be super excited about. It is a thing enabled by scraped datasets, in particular the CLIP model. It is also something that one person could do in $an_amount_of_time, and crucially, do by essentially cobbling bits of code and data together.
It relates to the quote from Karpathy above, suggesting that large language models aren't chatbots, but "the kernel process of a New Operating System", because LLMs now cover text, audio and visual I/O, code interpretation and runtime, network access, storage, and can be assessed by speed (i.e. token generatino per second). 
I do agree that chatbots are a user interface to the thing it is that large language models (and what they're associated with) can do. They are a text-based interface that fakes-up or pretends to be a natural language interface, and in that way are, following Bender, automated text generation engines, or automated query response engines.
But that's not only what they do, or rather, the chat interface also exposes particular domains that are better suited to different kinds of text generation. Some of those might be boilerplate code generation, in which case if you look at the abilities there, they're clearly not conscious. The trick in the text-based chat user interface is "pretending to be a human" when there's no need to be other than, for example, making an interface more accessible along certain traits. There doesn't need to be an "I" in the response. 
Another part is in summarizing and re-phrasing large amounts of text, of which one of the most interesting domain-specific examples I've personally seen is Dave Guarino's work in testing how well LLMs can summarize complex government regulations like SNAP / food stamp eligibility requirements or unemployment insurance requirements. This is very useful and also something that is hard for humans to do, but also, perhaps, relatively easy to check. Having a first-pass summarization engine/re-phrasing engine could be very useful!
And then there are papers concluding things like ChatGPT-4 "significantly increasing the performance of BCG consultants"⁸. But at least that paper acknowledges there are some things that "AI" is good at, and some things where it "falls short" (which I would better read as "isn't good" as opposed to this more positive framing), and covers established ideas like centaurs who divide and delegate and, differently, "cyborgs, who integrated their workflow". Anyway, apparently using ChatGPT-4 improves your average BCG consultant's speed by over 25%, human-rated performance by over 40%, and task completion by 12%
My suspicion, without reading the paper, is that boilerplate text generation and repetitive, domain-common low-level creativity tasks that also involve recall (come up with five ways to increase sales in this context) are the ones that get sped up. In which case... that's okay? If this AI stuff was supposed to be good, if the point of technology was to free us from drudgery, then great, freeing us from white collar drudgery of creating boilerplate text so people can focus on more creative, intellectually stimulating tasks that potentially provide more fulfillment is good, I think? (Leaving aside, again, the stupendous issues as to the ethical sourcing of training data). 
The question that gets raised at this point is the typical "well needing to generate a whole bunch of drudge/repetitive/boilerplate work seems more like a you problem than a me problem". That perhaps it's a problem with the environment: why do we keep needing to repeat those things? Isn't that sign that we should change process? I mean, maybe? Probably? But also in some cases, maybe you're also communicating domain knowledge to people who aren't familiar with it. You may well be reciting the same basic concepts, but what's required is some customization and fitting to slightly different contexts. Is this progression from standard template to automated template adaptation for a Minimally Viable Contextual Adaptation okay? Good enough? Maybe it is good enough for particular cases.
1.3 Old stuff holding back new stuff
A piece in DJ Magazine about digital audio workstations and how legacy DAWs -- with which there's a lot of familiarity -- are holding back new ideas⁶, in which case the answer is most likely in small cases of breakthrough new approaches that gradually get incorporated into/alongside existing products.

Okay, that's it for the beginning of the week. Slightly shorter today, which is good, because honestly I don't know how much longer I can keep this word count up without feeling paralyzed if I can't write "enough"
Best,
Dan

How you can support Things That Caught My Attention
Things That Caught My Attention is a free newsletter, and if you like it and find it useful, please consider becoming a paid supporter, at pay-what-you want. 
Do you have an expense account or a training/research materials budget? Let your boss pay, at $25/month, or $270/year, $35/month, or $380/year, or  $50/month, or $500/year.
Paid supporters get a free copy of Things That Caught My Attention, Volume 1, collecting the best essays from the first 50 episodes, and free subscribers get a 20% discount.

Predictive Policing Software Terrible At Predicting Crimes, Aaron Sankin, Surya Mattu, 2 October, 2023, The Markup and WIRED ↩

How We Assessed the Accuracy of Predictive Policing Software, Aaron Sankin, Surya Mattu, 2 October, 2023, The Markup ↩

Prediction: Bias follow-up, The Markup, Github ↩

These 183,000 books are fuelign the biggest fight in publishing and tech, Alex Reisner, 25 September, 2023, The Atlantic ↩

Inside the secret list of websites that make AI like ChatGPT sound smart, Kevin Schaul, Szu Yu Chen, Nitasha Tiku, 19 April, 2023, The Washington Post ↩

What is the future of the DAW?, Declan McGlynn, 28 September, 2023, DJ Magazine ↩

Reinforcement learning from human feedback, Wikipedia ↩

Navigating the Jagged Technological Frontier, D^3 Faculty, 18 September, 2023, Harvard Business School Digital Data Design Institute and Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality ↩

Don't miss what's next. Subscribe to Things That Caught My Attention: