Bitget App

Trade smarter

Promoting data flow: How to break the bottleneck of AI data training with the help of encryption technology?

BlockBeats2024/05/31 08:22

By:BlockBeats

Original title: The Data Must Flow

Original author: SHLOK KHEMANI

Original translation: TechFlow

See if you can spot all the carefully curated references in the image

In the past two years, since a relatively unknown startup OpenAI released a chatbot application called ChatGPT, AI has come out of the shadows and into the spotlight. We are at a critical juncture in the process of machine intelligence permeating our lives. As the competition to control this intelligence intensifies, the demand for data to drive its development is also increasing. This is the subject of this article.

We discuss the scale and urgency of the data AI companies need and the problems they face in acquiring it. We explore how this insatiable demand threatens the internet we love and its billions of contributors. Finally, we introduce some startups that use cryptography to propose solutions to these problems and concerns.

A quick note before we dive in: This post is written from the perspective of training large language models (LLMs), not all AI systems. For this reason, I often use “AI” and “LLMs” interchangeably.

Showing the Data

LLMs require three main resources: computing power, energy, and data. Backed by a lot of capital, companies, governments, and startups are competing for these resources. Of the three, the competition for computing power has been the most notable, thanks in part to Nvidia’s rapidly rising stock price.

Training LLMs requires a large number of specialized graphics processing units (GPUs), specifically NVIDIA’s A100, H100, and upcoming B100 models. These computing devices are not something you can buy from Amazon or your local computer store. Instead, they cost tens of thousands of dollars. NVIDIA decides how to allocate these resources between its AI labs, startups, data centers, and hyperscale customers.

In the 18 months after ChatGPT launched, demand for GPUs far outstripped supply, with wait times as high as 11 months. However, as startups shut down, training algorithms and model architectures improve, other companies launch specialized chips, and NVIDIA scales up production, supply and demand dynamics are normalizing and prices are falling.

Second is energy. Running GPUs in data centers requires a lot of energy. According to some estimates, by 2030, data centers will consume 4.5% of the world's energy. As this surging demand puts pressure on existing power grids, tech companies are exploring alternative energy solutions. Amazon recently purchased a data center campus powered by a nuclear power plant for $650 million. Microsoft has hired a nuclear technology director. OpenAI's Sam Altman has backed energy startups such as Helion, Exowatt, and Oklo.

From the perspective of training AI models, energy and computing power are just commodities. Choosing B100 instead of H100, or nuclear power instead of traditional energy, may make the training process cheaper, faster, or more efficient, but it will not affect the quality of the model. In other words, in the race to create the smartest and most human-like AI models, energy and computing power are just essential elements, not decisive factors.

The key resource is data.

James Betker is a research engineer at OpenAI. By his own account, he has trained more generative models than anyone should have the right to train. In a blog post, he noted that “trained long enough on the same dataset, almost every model with enough weights and training time will eventually converge to the same point.” This means that the factor that distinguishes one AI model from another is the dataset, and nothing else.

When we call a model “ChatGPT”, “Claude”, “Mistral”, or “Lambda”, we are not talking about its architecture, the GPUs used, or the energy consumed, but the dataset it was trained on.

If data is the food of AI training, then models are what they eat.

How much data does it take to train a state-of-the-art generative model? The answer is a lot.

GPT-4, which is still considered the best large-scale language model more than a year after its release, was trained using an estimated 12 trillion tokens (or about 9 trillion words). This data came from crawling the publicly available internet, including Wikipedia, Reddit, Common Crawl (a free, open repository of web crawled data), over a million hours of transcribed YouTube data, and code platforms like GitHub and Stack Overflow.

If you think that’s a lot of data, wait a minute. There’s a concept in generative AI called “Chinchilla Scaling Laws,” which means that for a given compute budget, it’s more efficient to train a smaller model on a larger dataset than to train a larger model on a smaller dataset. If we extrapolate the compute resources that AI companies are expected to use to train the next generation of AI models, such as GPT-5 and Llama-4, we find that these models are expected to require five to six times as much compute power, using up to 100 trillion tokens to train.

With much of the public internet data already being scraped, indexed, and used to train existing models, where will the additional data come from? This has become a cutting-edge research problem for AI companies. There are two solutions. One is to generate synthetic data, i.e., data generated directly by LLMs, rather than humans. However, the usefulness of this data in making models smarter has not been tested.

Another approach is to simply look for high-quality data instead of generating it synthetically. However, obtaining additional data is challenging, especially when the problems AI companies face threaten not only the training of future models but also the effectiveness of existing models.

The first data problem involves legal issues.Although AI companies claim to be using "publicly available data," much of it is copyrighted. For example, the Common Crawl dataset contains millions of articles from publications such as The New York Times and The Associated Press, as well as other copyrighted materials.

Some publications and creators are taking legal action against AI companies, alleging copyright and intellectual property infringement. The New York Times sued OpenAI and Microsoft for "unlawful copying and use of The New York Times' unique and valuable work." A group of programmers has filed a class-action lawsuit challenging the legality of using open source code to train GitHub Copilot, a popular AI programming assistant.

Comedian Sarah Silverman and author Paul Tremblay have also sued AI companies for using their work without permission.

Others are embracing change by partnering with AI companies. The Associated Press, the Financial Times, and Axel Springer have all signed content licensing deals with OpenAI. Apple is exploring similar deals with news organizations like Condé Nast and NBC. Google agreed to pay Reddit $60 million per year for access to its API for model training, while Stack Overflow has a similar deal with OpenAI. Meta reportedly considered buying publishing company Simon Schuster outright.

These arrangements coincide with the second problem facing AI companies: the closing of the open web.

Internet forums and social media sites have recognized the value that AI companies bring by training models using their platform data. Before reaching an agreement with Google (and potentially with other AI companies in the future), Reddit began charging for its previously free API, ending its popular third-party clients. Similarly, Twitter has restricted access to its API and raised prices, and Elon Musk uses Twitter data to train models for his own AI company, xAI.

Even smaller publications, fan fiction forums, and other niche corners of the internet that produce content for everyone to consume for free (and, if at all, monetized through advertising) are now closing down. The internet was once envisioned as a magical online space where everyone could find tribes that shared their unique interests and quirks. That magic seems to be slowly fading.

This rise in litigation threats, multi-million-dollar content deals, and the closing of the open web have two implications.

1. First, the data wars are heavily tilted in favor of the tech giants. Startups and small companies can neither access previously available APIs nor afford the money required to purchase the rights to use them without legal risk. This has a clearly centralized nature, where the rich can buy the best data to create the best models and get even richer.

2. Second, the business model of user-generated content platforms is increasingly skewed against users. Platforms like Reddit and Stack Overflow rely on the contributions of millions of unpaid human creators and administrators. However, when these platforms make multi-million dollar deals with AI companies, they neither compensate nor ask for permission from their users, without whom there is no data to sell.

Both Reddit and Stack Overflow experienced significant user walkouts as a result of these decisions. The Federal Trade Commission (FTC) has also launched an investigation into Reddit’s practice of selling, licensing, and sharing user posts to outside organizations for use in training AI models.

These issues raise relevant questions for training the next generation of AI models and for the future of internet content. As things stand, the future doesn’t look promising. Could encryption solutions go some way toward leveling the playing field for smaller companies and internet users and addressing some of these issues?

Data Pipeline

Training AI models and creating useful applications is complex and expensive work that requires months of planning, resource allocation, and execution. These processes consist of multiple stages, each with different purposes and data requirements.

Let’s break down these stages to understand how cryptography fits into the larger AI puzzle.

Pre-training

Pre-training is the first and most resource-intensive step in the LLM training process and forms the foundation of the model. In this step, the AI model is trained on a large amount of unlabeled text to capture general knowledge of the world and language usage information. When we say that GPT-4 was trained using 12 trillion tokens, this refers to the data used in pre-training.

To understand why pre-training is fundamental to LLM, we need to have a high-level overview of how LLM works. Please note that this is just a simplified overview. You can find a more thorough explanation in Jon Stokes’ excellent article, Andrej Karpathy’s fun video, or Stephen Wolfram’s excellent book.

LLMs use a statistical technique called Next-Token Prediction. In simple terms, given a sequence of tokens (i.e., words), the model tries to predict the next most likely token. This process is repeated to form a complete response. Thus, you can think of large language models as “completion machines.”

Let’s understand this with an example.

When I ask ChatGPT “What direction does the sun rise from?”, it first predicts the word “the” and then predicts each word in the sentence “sun rises from the East” in turn. But where do these predictions come from? How does ChatGPT determine that “the sun rises from” should be followed by “the East” and not “the West”, “the North”, or “Amsterdam”? In other words, how does it know that “the East” is more statistically likely than the other options?

The answer is to learn statistical patterns from large amounts of high-quality training data. If you consider all the text on the Internet, what is more likely to appear - "the sun rises in the east" or "the sun rises in the west"? The latter may appear in specific contexts, such as literary metaphors ("that's as ridiculous as believing that the sun rises in the west") or discussions about other planets (such as Venus, where the sun does rise in the west). But in general, the former is more common.

By repeatedly predicting the next word, LLM develops a general worldview (what we call common sense) and an understanding of language rules and patterns. Another way to think of LLM is to think of it as a compressed version of the internet. This also helps understand why data needs to be both plentiful (more patterns to choose from) and high quality (to increase the accuracy of pattern learning).

But as discussed earlier, AI companies are running out of data to train larger models. Training data requirements are growing much faster than new data is generated in the open internet. With lawsuits looming and major forums shutting down, AI companies face a serious problem.

This problem is exacerbated for smaller companies, who cannot afford to strike multi-million dollar deals with proprietary data providers like Reddit.

This brings us to Grass, a decentralized residential proxy provider that aims to solve these data problems. They call themselves the “data layer for AI.” Let’s first understand what a residential proxy provider is.

The internet is the best source of training data, and scraping the internet is the preferred method for companies to obtain this data. In practice, scraping software is hosted in data centers for scale, convenience, and efficiency. But companies with valuable data don't want their data to be used to train AI models (unless they are paid). To enforce these restrictions, they often block IP addresses of known data centers, preventing large-scale scraping.

This is where residential proxy providers come in handy. Websites only block IP addresses of known data centers, not connections of regular internet users like you and me, making our internet connections, or residential internet connections, valuable. Residential proxy providers aggregate millions of these connections to scrape data for AI companies at scale.

However, centralized residential proxy providers operate in secret. They are often unclear about their intentions. Users may be reluctant to share if they know a product is using their bandwidth. Worse, they may demand compensation for the bandwidth used by the product, which in turn reduces their profits.

To protect their bottom line, residential proxy providers piggyback their bandwidth-consuming code into widely distributed free applications, such as mobile utility apps (like calculators and voice recorders), VPN providers, and even consumer TV screensavers. Users think they are getting a free product, when in reality a third-party residential provider is consuming their bandwidth (these details are often buried in rarely read terms of service).

Eventually, some of this data goes to AI companies, who use it to train models and create value for themselves.

While running his own residential proxy provider, Andrej Radonjic realized the unethical nature of these practices and how unfair they were to users. He saw the evolution of crypto and identified a way to create a fairer solution. This was the context in which Grass was founded in late 2022. A few weeks later, ChatGPT was released, changing the world and putting Grass in the right place and time.

Unlike the covert tactics used by other residential proxy providers, Grass explicitly informs users about their bandwidth usage for the purpose of training AI models. In return, users are directly rewarded. This model disrupts the way residential proxy providers operate. By voluntarily providing bandwidth and becoming part owners of the network, users move from passive participants to active advocates, improving the reliability of the network and benefiting from the value generated by AI.

Grass’ growth has been impressive. Since its launch in June 2023, they have over 2 million active users who run nodes and contribute bandwidth by installing a browser extension or mobile app. This growth has been achieved without external marketing costs, thanks to a very successful referral program.

Using Grass’s service allows all kinds of companies, including large AI labs and open source startups, to obtain crawled training data at a low cost. At the same time, every ordinary user gets paid for sharing their internet connection and becomes part of the growing AI economy.

In addition to the raw crawled data, Grass also provides some additional services to its customers.

First, they convert unstructured web pages into structured data that is easy for AI models to process. This step, called data cleaning, is a resource-intensive task usually undertaken by AI labs. By providing structured, clean datasets, Grass enhances its value to customers. In addition, Grass has trained an open source LLM to automate the process of crawling, preparing, and labeling data.

Second, Grass bundles datasets with undeniable proof of origin. Given the importance of high-quality data for AI models, ensuring that datasets have not been tampered with by malicious websites or residential proxy providers is critical for AI companies.

The severity of this problem is why organizations like the Data Trust Alliance have been formed, a nonprofit group of more than 20 companies, including Meta, IBM, and Walmart, working together to create data provenance standards to help organizations determine whether a dataset is appropriate and trustworthy.

Grass is taking similar steps. Every time a Grass node scrapes a webpage, it also records metadata that verifies the origin of that webpage. These provenance proofs are stored on the blockchain and shared with clients (who can further share them with their users).

Although Grass is building on Solana, one of the highest throughput blockchains, it is not feasible to store provenance for each scrape on L1. Therefore, Grass is building a rollup (one of the first on Solana) that batches provenance proofs using a ZK processor and then publishes them to Solana. This rollup, which Grass calls the “data layer for AI,” becomes the data ledger for all the data they scrape.

Grass’ Web 3-first approach gives it several advantages over centralized residential proxy providers. First, by using rewards to encourage users to share bandwidth directly, they distribute the value generated by AI more fairly (while also saving the cost of paying app developers to bundle their code). Second, they can charge a premium for providing customers with “legitimate traffic”, which is very valuable in the industry.

Another protocol working on the “legitimate traffic” side is Masa. The network allows users to pass on their login information for social media platforms such as Reddit, Twitter, or TikTok. Nodes on the network then scrape highly contextual updates from these platforms. The advantage of this model is that the data collected is what the average user sees on their social media platforms. In real time, you can get rich datasets that explain sentiment or content that is about to go viral.

There are two main uses for these datasets.

1. Finance - If you can see what thousands of people see on their social media platforms, you can develop trading strategies based on this data. Autonomous agents that leverage sentiment data can be trained on Masa’s dataset.

2. Social - The advent of AI companions (or tools like Replika) means we need datasets that mimic human conversations. These conversations also need to be up-to-date. Masa’s data stream can be used to train an agent that can meaningfully discuss the latest Twitter trends.

Masa’s approach is to take information from closed gardens like Twitter with user consent and make that information available to developers to build applications. Such a social-first approach to data collection also allows for datasets to be built around regional languages.

For example, a bot that speaks Hindi could use data taken from social networks that operate in Hindi. The applications these networks open up are yet to be explored.

Model Alignment

Pre-trained LLMs are far from ready for production use. Think about it. The model currently only knows how to predict the next word in a sequence and nothing else. If you give a pre-trained model some text, like “Who is Satoshi Nakamoto”, any of these would be valid responses:

1. Complete the question: Satoshi Nakamoto?

2. Turn phrases into sentences: is a problem that has plagued Bitcoin believers for years.

3. Actually Answering Questions: Satoshi Nakamoto is the pseudonymous person or group that created Bitcoin, the first decentralized cryptocurrency, and its underlying technology, blockchain.

A third type of response will be provided by LLMs that are designed to provide useful answers. However, pre-trained models do not respond consistently or correctly. In fact, they often spit out random text that makes no sense to the end user. In the worst case, the model gives information that is factually inaccurate, toxic, or harmful while being kept confidential. When this happens, the model is "hallucinating."

This is how a pre-trained GPT-3 answers questions

The goal of model alignment is to make the pre-trained model useful to the end user. In other words, to transform it from a mere statistical text completion tool into a chatbot that understands and aligns with the user's needs and can have a coherent, useful conversation.

Conversation Fine-tuning

The first step in this process is conversation fine-tuning. Fine-tuning refers to taking a pre-trained machine learning model and further training it on a smaller, targeted dataset to help adapt it to a specific task or use case. For training LLM, this specific use case is to have human-like conversations. Naturally, this fine-tuning dataset is a set of human-generated prompt-response pairs that show the model how to behave.

These datasets cover different types of conversations (QA, summarization, translation, code generation) and are usually designed by highly educated humans (sometimes called AI tutors) who have excellent language skills and subject matter expertise.

State-of-the-art models like GPT-4 are estimated to be trained on ~100,000 such prompt-response pairs.

Examples of prompt-response pairs

Reinforcement Learning from Human Feedback (RLHF)

This step can be thought of as similar to how humans train their pet dogs: reward good behavior, punish bad behavior. The model receives a prompt, and its response is shared with human annotators, who rate it (e.g. 1-5) based on the accuracy and quality of the output. Another version of RLHF generates a prompt and produces multiple responses, which are ranked from best to worst by human annotators.

RLHF Task Examples

RLHF aims to guide models towards human preferences and desired behaviors. In fact, if you are a user of ChatGPT, OpenAI will use you as a RLHF data annotator as well! This happens when the model sometimes generates two responses and asks you to choose the better one.

Even simple thumbs up or thumbs down icons, prompting you to rate the helpfulness of an answer, are a form of RLHF training for the model.

When using AI models, we rarely consider the millions of hours of human labor that went into it. This isn’t just a need unique to LLMs. Historically, even traditional machine learning use cases like content moderation, autonomous driving, and tumor detection have required significant human involvement in data annotation. (This excellent 2019 New York Times story gives a behind-the-scenes look at the Indian offices of iAgent, a company that specializes in human annotation.)

Mechanical Turk, which Fei-Fei Li used to create the ImageNet database, is what Jeff Bezos calls “artificial AI” because of the behind-the-scenes role its workers play in AI training.

In a bizarre story from earlier this year, Amazon’s Just Walk Out stores, where customers can simply grab items from shelves and walk out (and be automatically charged later), are driven not by some advanced AI but by 1,000 Indian contract workers manually sifting through store footage.

The point is, every large-scale AI system relies on humans to some degree, and LLMs have only increased the demand for these services. Companies like Scale AI, whose clients include OpenAI, reached an $11 billion valuation through this demand. Even Uber is reassigning some of its Indian workers to annotate AI output when they’re not driving their vehicles.

On their way to becoming a full-stack AI data solution, Grass is also entering this market. They will soon release an AI annotation solution (as an extension of their main product) where users can earn rewards for completing RLHF tasks.

The question is: what advantages does Grass have over the hundreds of centralized companies in the same space by doing this in a decentralized way?

Grass can bootstrap a network of workers through token incentives. Just as they reward users with tokens for sharing their internet bandwidth, they can also reward humans for annotating AI training data. In the Web2 world, the user experience of paying globally distributed gig economy workers, especially for globally distributed tasks, is nowhere near the instant liquidity provided on a fast blockchain like Solana.

The crypto community, and Grass’s existing community in particular, already has a large population of educated, internet-native, and tech-savvy users. This reduces the resources Grass needs to spend on recruiting and training workers.

You might wonder if the task of annotating AI model responses in exchange for rewards would attract farmers and bots. I’ve wondered that, too. Fortunately, there has been a lot of research exploring the use of consensus techniques to identify high-quality annotators and filter out bots.

Note that Grass, at least for now, is only entering the RLHF market and not helping companies with conversational fine-tuning, which requires a highly specialized workforce and logistics that are more difficult to automate.

Specialized Fine-tuning

Once we’ve completed the pre-training and alignment steps, we have what we call a base model. The base model has a general understanding of how the world works and can hold fluent human-like conversations on a wide range of topics. It also has a good grasp of language and can easily help users write emails, stories, poems, essays, and songs.

When you use ChatGPT, you are interacting with the base model, GPT-4.

Base models are general-purpose models. While they have enough knowledge about millions of topics, they do not specialize in any one. When asked to help understand Bitcoin's token economics, the response will be useful and mostly accurate. However, you should not trust it too much when you ask it to list the security edge case risks of a rehypothecation protocol like EigenLayer.

Remember that fine-tuning refers to taking a pre-trained machine learning model and further training it on a smaller, targeted dataset to help it adapt to a specific task or use case. Previously we discussed fine-tuning when turning a raw text completion tool into a conversational model. Similarly, we can also fine-tune the resulting base model to specialize in a specific domain or task.

Med-PaLM2 is a fine-tuned version of Google's base model PaLM-2, designed to provide high-quality answers to medical questions. MetaMath is a mathematical reasoning model fine-tuned on Mistral-7B. Some fine-tuned models specialize in broad categories like storytelling, text summarization, and customer service, while others specialize in niche areas like Portuguese poetry, Hinglish translation, and Sri Lankan law.

To fine-tune a model for a specific use case, you need a high-quality dataset in the relevant domain. These datasets can come from specific websites (like the encrypted data in this newsletter), proprietary datasets (such as a hospital might transcribe thousands of doctor-patient interactions), or the experience of experts (which require detailed interviews to capture).

As we move into a world with millions of AI models, these niche, long-tail datasets are becoming increasingly valuable. Owners of these datasets, from large accounting firms like EY to freelance photographers in Gaza, are scrambling to acquire what are quickly becoming the hottest commodity in the AI arms race. Services like Gulp Data have emerged to help businesses fairly assess the value of their data.

OpenAI has even issued an open request for data partners, seeking entities with “large-scale datasets reflective of human society that are not currently easily publicly accessible.”

We know of at least one good way to match buyers and sellers of niche products: internet marketplaces. Ebay created one for collectibles, Upwork for human labor, and countless platforms for countless other categories. Not surprisingly, we’ve also seen the emergence of marketplaces for niche datasets, some of which are decentralized.

Bagel is building “artificial universal infrastructure,” a set of tools that enables holders of “high-quality, diverse data” to share their data with AI companies in a trustless and privacy-preserving manner. They use techniques like zero-knowledge (ZK) and fully homomorphic encryption (FHE) to achieve this.

Companies often have high-value data but cannot monetize it due to privacy or competitive concerns. For example, a research lab may have a large amount of genomic data but cannot share it to protect patient privacy, or a consumer product manufacturer may have supply chain waste reduction data but cannot disclose it without revealing competitive secrets. Bagel uses advances in cryptography to make these datasets useful while eliminating the attendant concerns.

Grass’s residential proxy service can also help create specialized datasets. For example, if you want to fine-tune a model that provides expert cooking recommendations, you can ask Grass to scrape data from subreddits like r/Cooking and r/AskCulinary on Reddit. Similarly, the creator of a travel-oriented model could ask Grass to scrape data from TripAdvisor forums.

While these aren’t exactly proprietary data sources, they can still be valuable additions to other datasets. Grass also plans to leverage its network to create archived datasets that any customer can reuse.

Context-Level Data

Try asking your favorite LLM “When is your training deadline?” and you’ll get an answer like November 2023. This means that the underlying model only provides information available until that date. This is understandable when you consider how computationally intensive and time-consuming it is to train these models (or even fine-tune them).

To keep them up to date in real time, you’d have to train and deploy a new model every day, which is simply not possible (at least not yet).

However, for many use cases, AI without up-to-date information about the world is useless. For example, if I were using a personal digital assistant that relied on LLMs responses, those assistants would be at a disadvantage when asked to summarize unread emails or provide the goalscorer for Liverpool's last game.

To get around these limitations and provide user responses based on real-time information, application developers can query and insert information into the input text called the "context window" of the underlying model. The context window is the input text that the LLM can process to generate a response. It is measured in tokens, which represent the text that the LLM can "see" at any given moment.

So, when I ask my digital assistant to summarize my unread emails, the app first queries my email provider to get the contents of all unread emails, inserts the response into a prompt sent to LLM, and appends a prompt like “I have provided a list of unread emails from Shlok’s inbox. Please summarize them.” With this new context, LLM can then complete the task and provide a response. Think of it as if you copy-pasted an email into ChatGPT and asked it to generate a response, but in the background.

In order to create applications with up-to-date responses, developers need access to real-time data. Grass nodes can scrape data from any website in real-time, providing developers with low-latency real-time data, simplifying the contextual augmentation generation (RAG) workflow.

This is also where Masa is positioned today. As it stands, Alphabet, Meta, and X are the only big platforms with constantly updated user data because they have the user base. Masa levels the playing field for small startups.

The technical term for this process is Retrieval Augmented Generation (RAG). The RAG workflow is at the heart of all modern LLM applications. This process involves text vectorization, which is the conversion of text into arrays of numbers that can be easily interpreted, manipulated, stored, and searched by computers.

GRASS plans to release physical hardware nodes in the future to provide customers with vectorized, low-latency, real-time data to streamline their RAG workflows.

Most industry players predict that context-level queries (also known as inference) will utilize the majority of resources (energy, compute, data) in the future. This makes sense. The training of a model is always a time-bound process that consumes a certain allocation of resources. Application-level usage, on the other hand, can have theoretically unlimited demand.

GRASS has seen this, with most of the requests for text data coming from customers who want real-time data.

The context window for LLMs is expanding over time. When OpenAI first released ChatGPT, it had a context window of 32,000 tokens. Less than two years later, Google’s Gemini model had a context window of over 1 million tokens. 1 million tokens is the equivalent of 11 300-page books — a lot of text.

These developments have allowed the impact of content windows to be constructed far beyond access to real-time information. Someone could, for example, throw in the lyrics to all Taylor Swift songs, or an entire archive of this correspondence, into a context window and ask LLM to generate new content of a similar style.

Unless explicitly programmed not to, the model will generate a pretty decent output.

If you can sense where this discussion is headed, be prepared for what’s to come. So far, we’ve mainly discussed text models, but generative models are getting better at other modalities like sound, image, and video generation. I recently saw this really cool London illustration by Orkhan Isayen on Twitter.

Midjourney, a popular and really great text-to-image tool, has a feature called Style Tuner that generates new images that are similar in style to existing images (this feature also relies on a RAG-like workflow, but is not exactly the same). I uploaded Orkhan’s human-generated illustration and used the Style Tuner to prompt Midjourney to change the city to New York. Here’s what I got:

Four images that, if you were browsing through the artists’ illustrations, could easily be mistaken for their work. These images were generated by AI in 30 seconds based on a single input image. I asked for “New York,” but the subject could really be anything. Similar copying can be done in other modalities, like music.

Recalling the section we discussed earlier, one of the entities suing the AI company is a creator, and you can see why they have a point.

The internet was a boon to creators, enabling them to share their stories, art, music, and other forms of creative expression with the world; enabling them to find 1,000’s of true fans. Now, that same global platform is becoming the biggest threat to their livelihoods.

Why pay Orkhan a $500 commission when you can get a good enough similar work with a $30/month Midjourney subscription?

Sounds like a dystopia?

The wonderful thing about technology is that it almost always finds solutions to the problems it creates. If you flip a situation that seems to be stacked against creators, you see that it’s an unprecedented opportunity to monetize their talent.

Before AI, the amount of work Orkhan could create was limited by the hours in a day. With AI, they can now theoretically serve an unlimited number of customers.

To understand what I mean, let’s look at elf.tech, the AI music platform from musician Grimes. Elf Tech allows you to upload a recording of a song and it will transform it into the sound and style of Grimes. Any royalties earned by the song will be split 50/50 between Grimes and the creator. This means that as a fan of Grimes, or a fan of her sound, music, or releases, you can simply come up with an idea for a song and the platform will use AI to transform it into the sound of Grimes.

If the song goes viral, both you and Grimes benefit. This also allows Grimes to passively scale her talent and leverage her distribution.

TRINITI is the core technology of elf.tech, developed by CreateSafe. Their paper reveals one of the most interesting intersections of blockchain and generative AI technologies we have foreseen.

Expanding the definition of digital content through creator-controlled smart contracts and reimagining distribution through blockchain-based peer-to-peer payment access microtransactions, enabling any streaming platform to instantly authenticate and access digital content. Generative AI then executes instant micropayments and streams the experience to the consumer based on the terms specified by the creator.

Balaji sums it up more simply.

As new mediums emerge, we rush to figure out how humans will interact with them, and when combined with networks, they become powerful engines of change. Books fueled the Protestant Revolution, radio and television were key parts of the Cold War, and media are often double-edged swords that can be used for good or ill.

Today, what we have are centralized companies that hold vast amounts of user data. It’s like we trust our companies to do the right thing for our creativity, mental health, and the development of a better society, handing so much power to a handful of companies, many of which we barely understand the inner workings of.

We are in the early stages of the LLM revolution. Much like Ethereum in 2016, we have little idea of what kind of applications can be built with them. An LLM that can communicate with my grandma in Hindi? An agent that can find high-quality data in various information streams? A mechanism for independent contributors to share culturally specific nuances (like slang)? We don’t quite know what’s possible yet.

What is clear, however, is that building these applications will be limited by one key ingredient: data.

Protocols like Grass, Masa and Bagel are the infrastructure for its access and drive its access in a fair way. The human imagination is the limit to what can be built on top of it. That seems exciting to me.

Original link

欢迎加入律动 BlockBeats 官方社群：

Telegram 订阅群： https://t.me/theblockbeats

Telegram 交流群： https://t.me/BlockBeats_App

Twitter 官方账号： https://twitter.com/BlockBeatsAsia

Disclaimer: The content of this article solely reflects the author's opinion and does not represent the platform in any capacity. This article is not intended to serve as a reference for making investment decisions.

PoolX: Stake to earn

APR up to 10%. Always on, always earning.

Stake now!