A quality dataset from Microsoft for training compact yet powerful language models that generate code

A quality dataset from Microsoft for training compact yet powerful language models that generate code

Learning large neural networks is an art. The following two facts have long been known in the field of AI. First, high-quality training data significantly affects the improvement of the performance of large models. Second, the application of such data can challenge the scaling laws related to the size of models and data.

Microsoft’s research team, inspired by these ideas, conducted an experiment, the report of which Textbooks Are All You Need can be found at arXiv.org. As part of the experiment, a large language model for code generation, called phi-1, was created. The training of this model was carried out using a specially prepared data set, the quality of which can be compared with programming textbooks. As a result, the phi-1 model, despite the fact that it uses only 1.3 billion parameters, has shown results that surpass what perfect large language models are capable of.

Research is aimed at learning language models for code generation. It is aimed at demonstrating that high-quality data has the power to change the current situation, where improving the capabilities of models is directly related to increasing their size.

Data of different educational value (left – high, right – low)

First, the team showed a technique for creating high-quality data sets, the use of which allows improving the training results of compact models. A classifier based on a transformer (GPT-4) was used during data preparation. With its help, examples of Python code with high educational value were selected. The source of the code was publicly available data from The Stack and StackOverflow platforms. The model, which was used to annotate the code samples, gave the following prompt: “determine its educational value for schoolchildren who have an important basis of the concept”, that is, it was asked to determine the educational value of the text in terms of a student whose goal is to master basic programming concepts. The researchers made sure that the code samples that would be included in the final data set would be diverse and non-repetitive.

The researchers created a basic model of phi-1-base – a transformer that contains only a decoding component and has 1.3 billion parameters. The base model was trained on the CodeTextbook dataset. After that, after retraining the model on the CodeExercises data set, we reached the phi-1 model. Experiments were also conducted with a reduced version of the model – phi-1-small.

Performance results of the phi-1, phi-1-base and phi-1-small models. The phi-1-base model has difficulty interpreting the logical connections in the prompt, while the phi-1 model is able to correctly interpret the question and issue an answer. It is shown here that even the small phi-1-small model with 350 million parameters, while giving the wrong answer, shows some level of understanding of the task.

Research has shown that retraining the model on the CodeExercises dataset led to significant improvements not only in the model’s skills that were intended to be strengthened, but also in unrelated skills. In particular, it is about the possibility of using external libraries such as Pygame and Tkinter by the model, even taking into account the fact that these libraries were not used in the exercises.

The number of imports of different libraries in the 879486 exercises used to train the model (there are no libraries imported less than 10 times). This diagram is generated by the phi-1 model given the following prompt: “I have a dictionary, first sort the dictionary using the value, from largest to smallest. The generate a pyplot bar plot. The first set font size to be 7, then rotate x-axis label by 90 degree, x-axis is the key, y-axis is the value of the dictionary. Use log-scale on y-axis. More, press the y-axis label to be ‘Log Number of Times’ and the x-axis label to be ‘Imports’. Set dpi to be 1000.”

The phi-1 model, despite being much smaller than the other models, outperforms them in the HumanEval and MBPP tests. Only GPT-4 is an exception (as well as WizardCoder, but this model is better than phi-1 only in the HumanEval test)

Experimental results show that the phi-1 model achieved a score of 50.6% in the HumanEval (Pass@1) test, and 55.5% in the MBPP (Pass@1) test.

As a result, it can be said that this study confirms the importance of developing methods for creating high-quality datasets. Such methods are an important element of work on improving the learning capabilities of large language models. In particular, it is about improving the code generation capabilities of the models.

Oh, and come work for us? 🤗 💰

We at wunderfund.io have been engaged in high-frequency algo trading since 2014. High-frequency trading is a continuous competition of the best programmers and mathematicians around the world. By joining us, you will be a part of this exciting battle.

We offer interesting and challenging tasks in data analysis and low latency development for enthusiastic researchers and programmers. Flexible schedule and no bureaucracy, decisions are quickly made and implemented.

We are currently looking for pros, pythonists, data engineers and ML researchers.

Join our team

Related posts