AI and licence erasure: It's getting worse, much worse. Code & fiction.

A bit of a warning: this post is going to go a bit ranty.

As you may know if you follow my account, I have been following AI for a while. In the prologue of Ragnarok Conspiracy, of what I wrote the first draft eight years ago in the summer of 2017, I projected a summer of 2027 future where dating site scammers would be using AI. Looking at where AI was in 2017, and where we are now with AI, I believe I did pretty decent at connecting the dots. We are still 20 months from the date of the Ragnarok Conspiracy, and while I think AI might reach the level described in the prologue, I don't expect that level to be available on the hardware a 17 year old hacker and scammer would likely possess. So while I think I predicted development quite decently by connecting the dots in 2017, I was likely a little bit too optimistic about the timeline.

A short excerpt from the prologue:

As Dakila pushed the F6 key, the lifelike Ligaya simulation ran through a set of facial movement that had been Dakila’s ticket to success on oh so many occasions.
His AI combined aspects of a disarmingly childlike cuteness with enough of a sexual undertone to render these middle-aged horny westerners completely blind to the small flaws in the AI,

Dakila made none of the beginner's mistakes like making his sims too hot. No professional fashion models as template; No explicit sexual innuendo. No! Dakila’s sims were designed to be the type of girl that is cute but innocent. Not too innocent, no, and not too young. Don’t want to attract no pedophiles and through them attract coppers.

No, all of Dakila’s creations were based on local women in their late twenties. Add a bit of irresistible naivety and his own special brand of kitten-cute sensuality, and those midlife crisis horndogs would let their guard down just enough to slowly start milking them for BareunCoin. And to be fair, most of them really had it coming. Their ring fingers more often than not showed discoloration and indentations that could only be explained by a wedding band having quite recently been worn on them. No these assholes got what was coming to them, 'Karma motherfuckers', and if Dakila could make a lot of money by milking them to improve his finance, that was just a nice bonus.

Theo was one of the worst and thus one of the best. Four different sims were milking Theo. Each of the sims got dangled the prospect of romance and even marriage. Theo was clearly working towards a visit to the Philippines next summer, working towards taking advantage of what he believed to be real ladies. He must have thought himself a real player, but all the while the BareunCoin kept flowing into Dakila’s wallets.

But prediction is one thing, protecting what we create is another. That's why I drafted the Open World Licence.

As an author I have been concerned about copyright infringement and micro plagiarism by AI. In my proposed Open World Licence, I tried combining my pet-peeve regarding platform-DRM on e-books with my concern for AI consumation and micro-plagiarism of copyrighted material. Three excerpts from the licence text.

AI Restrictive:

The copyright holder of this work declares that usage of this work in AI training
sets does NOT fall under any kind of "fair-usage" and as such is not allowed under
this licence.

AI Lenient

The copyright holder of this work grants the user of AI the right to use this
work as part of an AI training set under the same conditions as he does human
re-use.

The use of AI does not void the condition of attribution.

AI Permissive

The copyright holder of this work considers AI training to be FAIR-USE of and
grants the user of AI the right to use this work as part of an AI training set.

By using the Open World licence, an author can be explicit about their intentions regarding usage by AI of their copyrighted work.

Next to writing fiction, I am a developer too, and have been using AI in coding for a while now. Not long ago one of the main AI companies (I'm not dropping names for legal reasons) released a new model used by the frontend that I have been using, and a fun thing I always do to test for advances is feed the AI a chapter from my M.Sc thesis from 2017 as the core of my prompt. Chapter 6.1.1 describes a container meant for page-cache efficient access to a huge data file in a computer forensics context. It's pretty niche and as such a usefull test for real world quality.

New updates of the models have gradualy gotten better with this test, but now comes the problem, the latest model that became available not that long ago was not just better than the model before, it was strangely familiar. I won't publish the generated code because it might identify the AI company and I don't want to provoke legal actions, after a 3 year long streteched out divorce that only got finalized this summer, I've seen enough legal bills for now, but I'll share my own old code that I wrote.

This iteration of the model wasn't just better, it was familiar, it looked so much like my own code that a prompt to have it refactor my own code would likely have been less familiar. So far my skeptical views on AI haven't stopped me from using it for code, but with what I've seen now, I feel I can no longer morally justify using LLMs when coding. This isn't just micro-plagriatism, this is obfuscated pure licence erasure.

Just like with a copyright notice in a book, open source code comes with a licence. In my case the (old) 4 clause BSD licence.

Copyright (c) 2015, Rob J Meijer.
Copyright (c) 2015, University College Dublin
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright
   notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
   notice, this list of conditions and the following disclaimer in the
   documentation and/or other materials provided with the distribution.
3. All advertising materials mentioning features or use of this software
   must display the following acknowledgement:
   This product includes software developed by the .
4. Neither the name of the  nor the
   names of its contributors may be used to endorse or promote products
   derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY  ''AS IS'' AND ANY
EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL  BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AN
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE

As you can see, the almost verbatim generation of large parts of my own code isn't the problem here. It is open source after all. The thing that is the problem is that the frontend did so without re-generating this licence with the generated code.

The fact that my prompt is very much niche makes the fact that I run into training data regurgitation actualy not all that unexpected, although it seems to be telling that at a time when LLMs keep getting better while many experts have predicted it should stall. My hypothesis is that LLM model progress right now could be 100% due to an increased disregard for morality.

Another interesting coincidense: I'm guessing the release of new code models probably coincides with the release of new text models, and if these models are also more likely to training data regurgitation, then a bit of a distraction might be usefull for AI companies.

Maybe I'm being paranoid, but right now a HUGE distraction seems to be happening keeping many authors and publishers occupied. Social media is buzzing with anti piracy talk. Authors are very pasionate about the subject of piracy and piracy advocates are really steering the pot by baiting self published authors to side with AI companies by saying it's hypocite to condemn book piracy whern indie authors are using AI for making cover art.

Maybe I'm getting caught up in conspiracy theories, but the piracy debate is starting to feel quite good timed for AI companies if what I suspect about progress at the expense of ethics and training data regurgitation.

In theory the whole licence erasure problem should be trivial to fix by bucketed training. But this would result into multiple smaller training sets, what would reduce the power of code generation, but I feel that is a price that developers should be willing to pay. In the end we as developers are responsible for the copyright notices we infringe upon, not the LLM, not the AI company. For me this curently means that I'm canceling my subscription and won't be using any AI that I suspect might be doing massive licence erasure. Untill I get convincing data, curently for me, that is all of them.