A quality dataset from Microsoft for training compact yet powerful language models that generate code
Learning large neural networks is an art. The following two facts have long been known in the field of AI. First, high-quality training data significantly affects the improvement of the performance of large models. Second, the application of such data can challenge the scaling laws related to the size of models and data.
Microsoft’s research team, inspired by these ideas, conducted an experiment, the report of which Textbooks Are All You Need can be found at arXiv.org. As part of the experiment, a large language model for code generation, called phi-1, was created. The training of this model was carried out using a specially prepared data set, the quality of which can be compared with programming textbooks. As a result, the phi-1 model, despite the fact that it uses only 1.3 billion parameters, has shown results that surpass what perfect large language models are capable of.
Research is aimed at learning language models for code generation. It is aimed at demonstrating that high-quality data has the power to change the current situation, where improving the capabilities of models is directly related to increasing their size.
First, the team showed a technique for creating high-quality data sets, the use of which allows improving the training results of compact models. A classifier based on a transformer (GPT-4) was used during data preparation. With its help, examples of Python code with high educational value were selected. The source of the code was publicly available data from The Stack and StackOverflow platforms. The model, which was used to annotate the code samples, gave the following prompt: “determine its educational value for schoolchildren who have an important basis of the concept”, that is, it was asked to determine the educational value of the text in terms of a student whose goal is to master basic programming concepts. The researchers made sure that the code samples that would be included in the final data set would be diverse and non-repetitive.
The researchers created a basic model of phi-1-base – a transformer that contains only a decoding component and has 1.3 billion parameters. The base model was trained on the CodeTextbook dataset. After that, after retraining the model on the CodeExercises data set, we reached the phi-1 model. Experiments were also conducted with a reduced version of the model – phi-1-small.
Research has shown that retraining the model on the CodeExercises dataset led to significant improvements not only in the model’s skills that were intended to be strengthened, but also in unrelated skills. In particular, it is about the possibility of using external libraries such as Pygame and Tkinter by the model, even taking into account the fact that these libraries were not used in the exercises.
Experimental results show that the phi-1 model achieved a score of 50.6% in the HumanEval (Pass@1) test, and 55.5% in the MBPP (Pass@1) test.
As a result, it can be said that this study confirms the importance of developing methods for creating high-quality datasets. Such methods are an important element of work on improving the learning capabilities of large language models. In particular, it is about improving the code generation capabilities of the models.
Oh, and come work for us? 🤗 💰
We at wunderfund.io have been engaged in high-frequency algo trading since 2014. High-frequency trading is a continuous competition of the best programmers and mathematicians around the world. By joining us, you will be a part of this exciting battle.
We offer interesting and challenging tasks in data analysis and low latency development for enthusiastic researchers and programmers. Flexible schedule and no bureaucracy, decisions are quickly made and implemented.
We are currently looking for pros, pythonists, data engineers and ML researchers.
Join our team