s17e01: How old things keep living; Impossible AI; Look, just call it AI
0.0 Context Setting
It’s Monday, 8 January, 2024 in Portland, Oregon, and most of the parts of the Alaska Airlines 737 MAX 9 -- which is to say, the plug -- have been found nearby.
I think I might try something different this season -- separating out the episodes that cover the smaller things that caught my attention from the (probably) ones that are more full of reckons and longer thinking aloud.
0.1 Work with me!
Hey, it’s the new year! You’re excited about doing new things!
My consulting calendar has opened up for 2024 and I’m open to new clients. The quickest explanation for what I do is that I ask stupid questions as a service, plus you can read the great things people say about working with me.
1.0 Some Things That Caught My Attention
Three short things today:
1.1 How old things keep living
Here’s something mined from the orange website:
- A few years ago, someone (re)implemented a VAX system on FPGA1 -- an FPGA is a super nifty kind of chip that’s completely reprogrammable, so if you know what you’re doing, you can tell it to be any other sort of chip, from an old console to, well, a ~47 year old computer architecture, at varying degrees of speed.
- The PDP-11 is a minicomputer of the same vintage, and apparently trains in Melbourne, Australia that used to be controlled by an actual PD-11 are now controlled by a hardware PDP-11 CPU that’s stuck in a regular X86 PC. At least, so reckons an orange site commenter2.
- Back in 2013, General Electric made clear that there’s a bunch of nuclear power plant robotic automation that runs on (actual, non-emulated?) PDP-11s, and that (back then), they intended on using those PDP-11s through 2050 (which is past the Y2038 bug!)3. The PDP-11 is 54 years old and I suppose that totally makes sense in that there are 54 year-olds who’re still working and that’s totally fine, too. Wikipedia reckons 600,000 of them were sold.
Caught my attention because: Some great examples of more old things that haven’t gone away -- they get emulated or encapsulated and wrapped in the newer things.
1.2 Impossible AI
A bunch of my feed today was taken up with commentary on a Guardian article with this headline: “‘Impossible’ to create AI tools like ChatGPT without copyrighted material, OpenAI says”4. Do not be surprised if the longer reckon this week is a thinking-out-loud about “what we should do” about generative artificial intelligence, stochastic parrots, spicy autocomplete and so on.
Anyway, here’s what happened:
- Last month, The New York Times sued Open AI and Microsoft for copyright infringement5.
- Also last month (December 5, 2023), OpenAI made a submission to the U.K.’s House of Lords Communications and Digital Select Committee inquiry on large language models, and that’s where The Guardian’s money quote comes from:
“Because copyright today covers virtually every sort of human expression– including blog posts, photographs, forum posts, scraps of software code, and government documents–it would be impossible to train today’s leading AI models without using copyrighted materials. Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.” 6
Here’s some hand-waving, and at this point I’m pretty sure that this will be in a longer episode later this week, given that other people have asked me what I think about this whole mess too:
What we’re doing is too important
What OpenAI is saying here is that the AI tools they’ve created (yes, it’s more complicated than that) are Too Important For Society, so if what they’ve been doing (slurping up a whole bunch of copyrighted content and then transforming it and making it accessible in different ways) has been illegal, then the laws should change. Because what they’re doing is Too Important and Too Useful.
Asking questions leads to information, which is good
Well, is what OpenAI doing Too Important and Too Useful? That’s a larger question, which is why you’ve got government bodies like the U.K.’s House of Lords asking questions about the whole issue. This is good, this is how democratic governments are supposed to work. We’d rather they ask questions than not, right?
Of course OpenAI will say that a thing that depends on ingesting lots of things will be better if it ingests more things rather than fewer things. From OpenAI’s point of view, they’re pointing out that they’d develop the “best” tool by slurping up everything, rather than just things that are out of copyright.
I’ve written this before, copyright is a societal construction. The easy version is that it’s an incentive for people to make stuff, you get a short-term monopoly so you can make money off the stuff you make. Then there’s a bunch of carve-outs for Reasons That Society Thinks Are Important, like what happens if you write “No Copyright Intended” in the description of your YouTube video.
There’s the argument that slurping up everything that’s out there is totally fine (stupendously simplified) because Google does it and it turns out we let search engines do it. There’s a niggling feeling at the back of my mind that we “let them do it” because they’d already done it, and it turned out that we liked it quite a lot so we (i.e. “society” and “courts”) were persuaded that we should totally let Google keep doing that provided they keep providing a useful search service.
But! Here’s where it’s more complicated -- the NYT is essentially saying that OpenAI should license their content which yes, fine and not just grab it for free. But then in the interests of equity, and in what OpenAI is desperately trying to head off, then they technically need to license content from everyone else. Like you. Or me.
I will quickly point out that it is much easier, say, for the NYT to get paid for licensing than it is for you or I, see: every single time someone’s photo gets ripped off.
The counterpoint to this is, well, if we agree that we want all content to be properly licensed for purposes that include “slurping it up to go into a giant language model and put together word vectors” then only the organizations with a lot of money are going to be able to afford to license all that content, and we’re more aware than ever that letting the people with the money do the things that let them get even more money isn’t a great thing. Then people start getting worried about limiting competition - how could a poorly non-capitalized company -- or even person -- possibly compete against a company that’s ostensibly got billions of dollars in the bank? They couldn’t!
Scale breaks things again
There are complicated and difficult things that happen when you massively scale. Here’s what I think makes intuitive sense to people: “ingesting” all of this content is the same as “reading” things or “looking” at things and then remembering those things and making new things based on those things. A lot of the time, the things you read or looked at cost money, and in different ways, you paid money to read or look at those things! If you had to pay -- in various ways -- to read all those books on your journey to adulthood, shouldn’t OpenAI?
OpenAI are kind of arguing that “look, if you had to put a number of How Important What We’re Doing Is, it would be a Really Big Number” and that because the number is So Big -- i.e., so much bigger than your number for you, Prakash, reading this right now -- then they shouldn’t need to pay to read all that stuff. Which... is an interesting argument? I mean, what if you put together a whole bunch of Prakashs? How many Prakashs would it take for their What We’re Doing Is Important number to be big enough to qualify, in OpenAI’s argument, for not needing to pay? I mean, obviously it is impossible to learn everything without, uh, learning everything.
For all mankind?
Okay, so sometimes it is totally worth it to let an arbitrary number of Prakashes read books. Which is how we end up with libraries? We all decide that’s a good idea, we all throw some money into the pot to buy a bunch of books and then, modulo waiting some time, Prakash(es) get to read the books. Job done.
“Hang on a second, Dan”, you say. Didn’t the libraries buy all those books just then? Didn’t they pay for them? And don’t we pay for those books by, I don’t know, paying taxes?
Well: a) yes, b) yes, and c) hahaha have you heard about how well-funded libraries are these days, apparently the utility of libraries is under question these days!
And then you might also say “Just keep hanging, Dan, I have some more thoughts”, and I would say: this internal dialogue is going great, thanks, and you’d continue: “Okay, so libraries. The deal is that anyone can use a library, right? And that they’re free to use?”
And I’d have to say: a) yes! and b) oh that’s interesting, they are free to use! while at the same time bashing over the head people who are angry about having to pay taxes for things like libraries.
But the deal with OpenAI isn’t that everyone gets to use it for free. (I mean they do, now. Ish.) The deal with OpenAI is that you get to use it if you pay for it. OpenAI doesn’t want to just, I don’t know, be a responsible developer of non-human-hating artificial general intelligence (geez is this a complicated issue what with its corporate and non-profit status), but it also would like to make some money.
Roads aren’t supposed to make money. Libraries aren’t supposed to make money. Not directly. We have them because gestures they contribute to some sort of common outcome and benefit? They are enabling infrastructure: having those things lets us do things, and then, if you’re the sort of polity that works this way, the things the infrastructure lets you do lead to taxable productive activities!
We reserve our right to change our mind and to try things
Here are some ideas that for the life of me I can’t remember whether I’ve written and also because I haven’t built a search engine, or a retrieval-augmented generation engine for the entire newsletter:
- Assuming that generative AI is a net benefit for people making stuff that fuels economic activity if you’re into that kink, then is an open-access national training corpus a good idea?
- OK fine, let’s just have a grace period and let anyone who wants to slurp all the stuff and put it in models. Then we reserve the right to tax the fuck out of them in particular if they’re super successful.
- Even better, if they are super useful, then... nationalize them?
- Hell, can you eminent domain an entire application stack? “Hello, it’s super important for us to build a railroad here and you’re in the way, and it turns out we think everyone’s better off if we have a railroad and you move house.” becomes “Look, well done for making all this shit but it’s time people get to benefit more widely.” I mean clearly you can because in films SHIELD went and grabbed all of Dr. Foster’s stuff.
- Of course, you wouldn’t necessarily need to eminent domain the thing, you’d subsidize it if it became infrastructure, right? Like how you’d subsidize internet access or mobile phone plans if that turned out to be super important?
- All the above requires a government to actually have a point of view, a spine, and the political ability to change its mind.
Of course the (invariable American, west coast) people who founded all the AI companies that are part of the AI Spring7 [sic] have pointed out that if there were the slightest risk of their stupendously profitable work being taken from them or even the hint of it being taxed slightly less favorably than it is now, then they’d take all their toys and go home and none of us would be figuring out what we’d look as a Viking(?) Part of the position I assume is also “look, if we have to so much as check in every few months, that’s going to super piss us off as well”, which isn’t entirely the case because turns out governments are asking them to check in every few months so that they can provide statements like “obviously it’s impossible to make AI without being trained on copyrighted material”.
The long and the short of it -- as someone who last studied intellectual property law over twenty years ago and is now feeling super old -- is that it’s always a possibility to re-evaluate what copyright is for and to legislate for it if we want to reinforce or better ensure a purpose.
I have gone on so much over the years about the concept of universal service in plain-old telecomms and postal service applying also to universal compute -- and would point out that we can’t even reliably get universal service for internet yet! If this stuff is so useful then there’s still a model (as bad as it is) for “utility” companies that may also make money (and options for how much! Sometimes capped, sometimes not! Turns out you can just decide!) Governments also sometimes even like competition amongst utilities. Everyone can have electricity or water or a telephone, but you’ve got to pay for it. But these days, isn’t the argument as well that the marginal cost is, like, zilch? So maybe you can have some, as a member of that society/treat?
So I think that perhaps leaves a question: what’s the equivalent of a social utility in a training corpus?
1.3 Look, just call it AI
Meanwhile, Simon Willison reckons it’s okay to call all this stuff Artificial Intelligence8 because, I think, he’s pointing out that those of us who want to be more precise about what generative artificial intelligence (stochastic parrots, spicy autocomplete, a bunch of applied statistic) are fighting a losing battle on the side of prescriptivism versus descriptivism. Most people are calling these things “AI” and if you ask the random person on the street what “AI” is, then odds on you’ll get an answer that includes ChatGPT or whatever.
Willison advocates for making sure that people know that what’s currently called “AI” is not artificial general intelligence, which is also fair enough because it’s... not? (The sticking point here, I think, is that a bunch of people do think AI is artificial general intelligence because it talks like an artificially intelligent science fiction computer talks)
Caught my attention because: Sure. Fine. It’s as if there’s a need for some sort of taxonomy here. “AI” is an umbrella term, you’ll get my agreement on that.
But it’s undeniable that large language models like ChatGPT are really useful for people in some domains.
On the one hand, asking a recent version of ChatGPT to write (some) programs or scripts for you actually works, and if you want to be more specific about the domain, if the programs or scripts are scoped tightly enough, they’re verifiable. This isn’t just a timesaver, but it’s also bringing automation to people (admittedly, though, more often than not bleeding edge users).
Generative AI is good at replacing a blank page with something, which can be great for people who need a starting point. It isn’t great at producing insightful original work (citation needed), and to date, if you’re writing a history essay to hand in to your tutor at Oxford, then the process of co-writing/guiding an AI to write that essay -- if you want -- does put you in situations where you can learn things! 11
I was only supposed to spend 15 minutes writing this but instead I angrily bashed at my keyboard for... about an hour and a half?
How are you? It’s going to get cold here soon.
How you can support Things That Caught My Attention
Things That Caught My Attention is a free newsletter, and if you like it and find it useful, please consider becoming a paid supporter.
Let my boss pay!
Melbourne (Australia)'s train signals used to be controlled by PDP-11s running E... | Hacker News (archive.is), skissane, Hacker News, 7 July, 2021 ↩
‘Impossible’ to create AI tools like ChatGPT without copyrighted material, OpenAI says | OpenAI | The Guardian (archive.is), Dan Milmo, The Guardian, 8 January, 2024 ↩
Written evidence LLM0113 to the House of Lords Communications and Digital Select Committee Inquiry, OpenAI, 5 December, 2023 ↩