how we create a language model of a new generation
Learning large language models is one of the most relevant areas in machine learning. The largest IT companies are fighting over the creation of more advanced models. Including Yandex: we have been creating and using YaLM neural networks in our services for more than two years.
This year, model improvement became a company-wide priority. Internally, this work is known as the Genesis project, or YaLM 2.0. Its result was a big jump as our models.
The new model was called YandexGPT (YaGPT), you could try it for the first time in Alice on the request “Let’s figure it out” a little more than two weeks ago. Today we updated YaGPT: Alice learned to write replies based on the history of previous messages. In honor of this, we would like to tell Khabr the story of the entire project. In the near future, the new model will become part of other Yandex services.
Contents
We will remind you about YaLM
In 2021, we are for the first time
told
about our YaLM family of GPT-like language models. At that time, it included models from 1 to 13 billion parameters. Later, a model for 28 billion appeared. And finally, the same model for 100 billion that we put in open source last year. We’ve applied YaLM to search to generate short descriptions of object responses, to support dialogue in Alice, and many other places.
Technically, the YaLM learning process can be divided into two consecutive stages: Pretraining and P-tuning.
At the first stage, terabytes of text from the Internet, books, and other publicly available sources are broken into small fragments and fed to model training. In the learning process, we force the model to guess such a word in a text fragment, taking into account the previous ones.
To successfully solve this task, the model is forced to learn both the structure of the language (parts of speech, clauses, punctuation) and facts about the world. For example, to correctly guess the words and numbers in the text “The speed of light in a vacuum is 299,792,458 meters per second,” the model must remember what the speed of light is. It is Pretrain that determines the “erudition” of the neural network.
We will consider the second stage from an example. Suppose you ask the neural network the following questions:
- “answer briefly what is a rainbow”,
- “the answer is unfolded what is a rainbow”,
- “answer with a poem what a rainbow is.”
You will get very different answers to your questions: the prefixes “answer shorter”, “answer expanded” and “answer in verse” have a strong influence on them. Suppose a user simply asks “what is a rainbow”. Then the task of P-tuning is to
automatically
choose a good prefix. If you add such a prefix to the user’s request, the answer will be better.
More formally: P-tuning is a cheap method to solve a specific task. In this article, we discussed how P-tuning is used in fast search answers.
What we’ve greatly improved in YaGPT (YaLM 2.0)
New requirements for Pretrain
Everything that the model will know and be able to do is built into it at the Pretrain stage. Previously, we focused on the size of the Pretrain dataset, believing that if we put a lot of data into it, the model will learn a lot. Now we focused specifically on the quality of the data in the dataset, which allowed us to significantly improve the quality of the final model. By quality we mean two important requirements for a dataset.
First, the requirement to completeness of facts. We took a real stream of queries from our users and put together a dataset that contains the answers to most of them. To assess the quality, we even raised a separate, internal version of Yandex search on the pretraining dataset. This is how we make sure that the model has the opportunity to learn all useful knowledge about the world.
Secondly, to data purity. Useless idle texts waste training time and teach the model the wrong facts and knowledge. Imagine that Pretrain will write “The speed of light in a vacuum is 999 meters per second”. Such data simply spoil the model, forcing it to learn false knowledge. Dealing with the sources of such garbage is an important part of data preparation. For YaGPT, we managed to reduce the amount of bad data in the pretrain dataset by a factor of 4.
Creating a large Alignment dataset and using Fine-tuning instead of P-tuning
P-tuning is good because it allows you to quickly solve a specific task, but it has an important problem – it has quality limitations. This is because P-tuning optimizes thousands of parameters (model inputs). How do they contain the full depth of Alice’s personality? All the facts about her, her history, her benefits? After all, in addition to storing facts about the character, P-tuning should also teach the model to respond nicely to an arbitrary request, not to be rude, and much more.
To train the model on all of this, we collected a large dataset of hundreds of thousands of examples with good answers to all the aspects mentioned above. Fine-tuning is best for this amount of data. This is the same training process as Pretrain training. It optimizes billions of model parameters. Such training is more difficult than P-tuning, requires a large training sample, but the result is better.
Fine-tuning: as many different instructions as possible
It is important that the Fine-tuning dataset contains examples of the most diverse tasks (such tasks are also called instructions). The more different tasks are collected, the better the model will turn out. But collecting them is not so easy. Try to come up with at least a hundred different requests to Alice. Most likely, many questions will be monotonous.
In order to collect a large number of different instructions, we parallelized this work: a part was extracted from search queries, the other – from requests to Alice. In addition, they called out within the company and received tens of thousands of similar tasks from caring colleagues.
Fine-tuning: qualitative answers to instructions
It is very important that all answers to the collected instructions are of the highest quality. Training a model requires hundreds of thousands of responses. Writing them is a very large-scale task, because people need to understand an unfamiliar area every time. It can take an hour or two. Such an unusual specialty for Russia even has a name — AI-trainer. Starting this year, we are looking for such people. About a hundred AI trainers help us prepare answers to instructions. In addition, we leveraged our experience in crowdsourcing and recruited over a thousand assessors to assist coaches.
A separate reason for pride is our internal marathon. We contacted all our colleagues through Etushka (internal social network), hural (similar to All-hands meeting), corporate mailing list and announced a marathon to write standard answers to the instructions for further training of the model. In one week, 826 participants wrote more than 36 thousand answers for us! You may notice that this is the second time we have mentioned the collective help of our colleagues. The active, massive help of Yandexoids, who were not even connected to the project, allowed us to improve the quality of the model and launch the project faster.
With the help of AI-trainers, assessors and colleagues from other fields, we managed to collect several hundred thousand answers to the instructions. Approximately half of them turned out to be of high quality and formed the basis of the dataset.
Sequence of training
Not all people write equally well. There are significantly fewer good answers than the rest, but they are significantly better. In practice, the scheme has worked well, in which higher-quality texts are transferred to study at a later stage, and not mixed into one big pile.
Our best scheme at the time of the start in Alice
How we implemented YaGPT before Alice
Let’s teach the model to be Alice
Almost a month ago, we learned the basic YaGPT model. We started testing it with the whole team and quickly realized that the model is cool, but it cannot be implemented in Alice. Product refinements are needed. For example, the model did not know that she was Alice (she could not name her name, her creators, interests, and much more).
To train the model to be Alice, you need to collect a dataset with questions like “What’s your name”, “Who made you”, “What do you like” and the answers to them. To simplify this process, we invented a trick.
- We have collected a list of such questions about Alice.
- Many other similar questions have been generated using YaGPT.
- They wrote an outline that briefly describes Alice’s personality with a request to answer the questions on her behalf.
- Asked a question generated by YaGPT using an outline.
As a result, we received question-answer pairs, from which a new dataset was formed and the model was trained on it. Our model is partially self-taught!
On May 17th, we rolled out this model as the Let’s Invent skill in Alice.
We teach the model to work with the context
Today we released the first major update for YaGPT in Alice. Our model has learned to take into account the history of the dialogue when writing answers. This is useful because it allows you to refine your request.
It turned out that our basic YaGPT model was good at working with the context immediately after training. Perhaps due to the fact that some dialogues ended up in pre-training, perhaps due to the variety of Fine-tune tasks, the model learned completely new tasks. Dialogue skills were tested as follows: corresponded with the model and sent the entire previous dialogue to the input in each message. The model answered intelligently.
The main problem turned out to be that she clings too much to the context and does not understand that it has changed. It was necessary to train the model to ignore part of the message history. To do this, we chained several unrelated questions with answers into one dialog and trained the model to predict the last answer. So the model learned that sometimes the context can get in the way.
In addition, we retrained the model on dialogues that were published in open source by participants of the Open Assistant project.
Another task we had to solve was finding a balance between context length and response generation speed. If the answer is written for a long time, it is bad, because users expect an instant response from Alice. Our YaGPT model can handle 8k tokens (that’s about 40k characters) on input. In production, a context of this length will result in a significant response wait. Fortunately, usually such a length is not necessary. Therefore, the model now takes into account 2 thousand tokens (10 thousand characters) or 50 separate requests (depending on which limit will be reached first). This significantly speeds up the model’s responses.
Completion
This year we have done a lot of work on the quality of our models. But this position is not about the completion of the project, but about the beginning of a great journey. The launch of YandexGPT in Alice is only the first implementation. And this is what we will do in the future.
1. We continue to improve the quality of our datasets (Pretrain and Fine-tune).
2. We will introduce a new stage of training – RLHF.
3. Let’s train large models (for offline research and application).
4. We will invest in the number of AI trainers. We believe that in the future, the training of language models of a new generation is impossible without the help of such specialists.
5. We will implement YaGPT and our other services.
If you look events in Mykolaiv – https://city-afisha.com/afisha/