Building LLMs with Ethically Sourced Data

July 27, 2024

Large Language Models (AI) are taking the internet by storm with its capabilities. But at what cost? LLMs like ChatGPT have used data sourced from the internet without permission. But there is a way to fix this, Ethically Sourced Data.

Large Language Models (LLMs) have taken over in the AI industry as being the bleeding edge of technology. The most famous being ChatGPT, taking the world by storm by having a basic human level chat bot that can have conversations with its users pretty easily (although it makes plenty of mistakes). This has brought to light many issues with the current state of the AI industry, how they source their data.

Many users were shocked when they found similarities in the chat bots responses to popular blog posts or other independent, non open-source public content. Showcasing how AI companies have used data without the permission of its creators. The publics realization makes it seem like this data issue is new due to the need for human created data in AI. But this is not the case, your data has been used by the biggest tech companies since the inception of the internet. Many online products such as Google, and really anything that is free, make your data their product via their terms of services. This is how these free online products have been able to turn a profit. And now they are using that data that you gave them in their AI products.

This isn’t necessarily unethical, but it is shady, because they don’t make it clear that they will use your data in other products. I do think it is unethical, because they use your data to make their products better, to then sell it to you again. Its a shady business model that exploits its customers multiple times in non-obvious and manipulative ways.

What if there was a way to use “ethically sourced” data in AI products? Specifically LLMs. Well their is, all you have to do is buy the data. You can buy data on different online marketplaces. But the issue is LLMs require constant high quality language data from humans to keep up with the ever-changing slang and cultural landscape. How is this possible?

The quality of the data depends on the transparency and liquidity of the collection source. Meaning however the data is initially collected, it needs to share money with its creator. On Cuatex we have done this by enabling moderators to sell their data directly to the buyer, in this case, companies building LLMs. Moderation is very complex as people come up with work-arounds to say the bad word they intend to. But if moderators make money from tagging these words, it creates an incentive for them to increase the quality of their moderation. Creating high quality data. This creates a true capitalist cycle that allows everyone to make money from their participation in the economy.

Cuatex we let our users sell their data, giving them an incentive to create higher quality data, create a better internet and making LLMs ethical.

Join over 313 users who want to challenge the status quo.