What Is Corpus In AI? Why Are Tech Leaders Talking About It?

Language models like OpenAI's ChatGPT have been trained on massive corpora, which allows them to generate coherent and human-like text. This is what technology innovators call the “corpus”. Why are global tech influencers discussing it? How is it going to shape the future of humanity with AI?

2023 saw a surge in discussions surrounding artificial intelligence (AI), particularly with the advent of Generative AI and advanced platforms like ChatGPT in November 2022. It seems like every day brings forth a new article, speculating on whether AI will bring about the downfall or salvation of creative fields, jobs, or even humanity.

Amidst these debates, a peculiar term has emerged from the tech executives' lexicon: "corpus." Global tech leaders such as Steve Huffman, CEO of Reddit, Wikipedia's Founder, Jimmy Wales, and the renowned Microsoft founder, Bill Gates, have all explained the concept in various contexts.

“[ChatGPT] is truly imperfect. Nobody suggests it doesn't make mistakes, and it's not very intuitive. And then, with something like math, it'll just be completely wrong. Before it was trained, its self-confidence in a wrong answer was also mind blowing. We had to train it to do Sudoku, and it would get it wrong and say, “Oh, I mistyped.” Well, of course you mistyped, what does that mean? You don't have a keyboard, you don't have fingers! But you're “mistyping?” Wow. But that's what the corpus [of training text] had taught it.” - Bill Gates, Founder of Microsoft.

So, what exactly is an AI corpus?

For those familiar with Latin, the word "corpus" means "body". Additionally, the phrase "habeas corpus" exemplifies the legal context of the term, guaranteeing the right of individuals to appear before a judge when arrested. In English literature, the corpus is a collection of texts and we call it a corpus (plural: corpora) when we use it for language research.

However, in the realm of artificial intelligence, "corpus" takes on a metaphorical sense, representing the collection of data that serves as training material for AI systems. It is this corpus that endows AI with intelligence specific to its designed purpose.

The ‘humanised’ corpus in AI

It's essential to note that each AI's corpus is unique since humans determine the type of data used for training, often with guidance from experts in AI consulting. The corpus selected depends on the desired proficiency of the AI in question. The possibilities for corpora are virtually limitless, tailored to the objectives set by the AI's creators.

Let's take the example of Midjourney, a popular generative art platform that uses AI to create images based on text prompts. To achieve this, Midjourney's AI must be trained on both images and associated text descriptions. For instance, if the prompt is to generate an image of a waterfall, the corpus would contain images of waterfalls accompanied by relevant text labels.

Another notable AI platform, ChatGPT, falls under the category of a large language model (LLM) designed for text-based conversational interactions. Robust LLMs like ChatGPT can engage in conversational chats with users, given a sufficiently large and diverse corpus. Depending on the composition of its corpus, ChatGPT can tackle complex questions, generate original creative works such as short stories, or even create the code for a space shooter game. Its capabilities are determined by the data present in the corpus used during training.

The corpus of an AI application is based on its baseline dataset

According to ChatGPT itself, its corpus consists of a wide range of text from the internet, including websites, books, articles, and publicly available sources. According to ChatGPT, its corpus is “a breakdown encompassing websites, books, articles, research papers, conversational data, social media content from platforms like Twitter and Reddit, and text from Wikipedia articles spanning numerous topics.”

Interestingly, ChatGPT's corpus lacks images, as it is a text-based AI generator. Consequently, it lacks the ability to generate images since it was never trained on them.

“I do think there are some interesting opportunities for human assistance where if you had an AI that was trained on the right corpus of things — to say, for example, here are two Wikipedia entries, check them and see if there are any statements that contradict each other and identify tensions where one article seems to be saying something slightly different to the other”, Jimmy Wales, Co-Founder of Wikipedia.

The data fed into Midjourney and ChatGPT exemplify only two possibilities for corpus composition. Corpora can take various forms depending on the desired AI application. For instance, an AI designed to create music would require an audio corpus comprising songs, while an AI striving to emulate Hemingway's sparse writing style would need a corpus consisting solely of Hemingway's written works.

AI training with corpus: Legal complexities and copyright concerns

The use of corpora in AI training, however, raises complex legal questions surrounding copyright and intellectual property.

Clearly, the absence of a corpus impedes an AI's learning process, and the size of the corpus directly influences the AI's proficiency and intelligence. But the inclusion of copyrighted material in an AI's corpus raises concerns.

If an AI is trained on copyrighted works, does it violate copyright or intellectual property laws? For instance, if an AI generates artwork reminiscent of Banksy's style using a corpus comprising Banksy's works, does it infringe upon copyright? Similarly, if an AI with a corpus containing Rihanna's songs creates new, original songs resembling her voice, does it raise legal issues?

The ongoing discussion surrounding AI-generated media prompts governments to consider legislation to regulate generative AI models. The European Union, for instance, is proposing a law that requires AI owners to disclose whether copyrighted material forms part of the AI's corpus. This transparency aims to assist copyright holders in identifying the usage of their work within corpora and seeking appropriate compensation.

In the United States, the Congressional Research Service advises Congress to adopt a cautious approach, closely monitoring AI-generated copyright cases before updating legislation.

Unearthing the magic of corpus in generative arts

While the legal implications unfold, content creators have begun exploring revenue-generating opportunities presented by AI. Artists, authors, and musicians can package their works into corpora and sell access to AI companies. For instance, a painter could offer a corpus containing their artwork, authors could provide a corpus of their novels, and singers could sell a corpus of their vocals or negotiate compensation for AI-generated works fueled by their corpus.

Consequently, the emergence of a market for pre-packaged corpora could be as significant to the tech world as pickaxes were to the gold miners of the past. For instance, Devopedia is a platform with many corpora to perform a variety of NLP tasks.

As AI companies expand, the demand for comprehensive corpora will grow, potentially giving rise to a cottage industry of corpus sellers.

Unearthing the realm of AI with the power of its corpus

As AI continues to evolve and shape our world, the term "corpus" will become increasingly ingrained in our vernacular. Conversations surrounding AI and its impact will invariably include discussions on corpora—debating their legalities, assessing their revenue potential, and exploring their influence on the development of AI systems.

While the future remains uncertain, one thing is clear: understanding corpora is crucial for comprehending the fascinating world of AI.