Datasets are the most important resource to train any machine learning model. Even for a simple linear regression model, you need a dataset too. In this post, we are going to dive deeper into datasets that are commonly used to train LLM. There are thousands of datasets on the internet that you can find. LLM is trained on text data and text data is everywhere even this post can be a source of data to train LLM.
Many LLM research paper details the resources that were used to train and evaluate their models. So it is common that thousands of datasets are somewhat similar and with a slight modification or only a subset of it and researchers publish like a brand new dataset (for me at least). It is important to understand how the datasets are collected, processed, and formatted in a way that the model can understand the format of the data.
Text Datasets
Text datasets are used for the next-word prediction task, where the next word is the label of the preceding words. We call it a self-supervised learning method which is from unlabeled data and the labels are generated automatically1.
Book Corpus
[paper] [dataset] is originally made to align the movie and the book to provide rich descriptive explanations for visual content. You can think like a movie on Netflix and every action that the character made or the surroundings is explained by the narrator. That’s what this dataset is aimed for, to align between the book and the movie. To do so, the researcher released two datasets, the movie and the book. The book one consists of more than 11K books from the Smashwords. However, the original link has been retracted and others created similar resources.
Project Gutenberg
[web] is known for hosting open-access ebooks. It contains more than 70K free ebooks in multiple languages such as English, Chinese, Arabic, Japanese, Dutch, etc. The website itself is easy to navigate for web scraper. So, many people scrape the website and get the content of the book. The book page also provides downloadable files such as epub and HTML. It makes many researchers utilize this dataset to train language models.
Common Crawl
[web] is a web portal that (as the name suggests) crawls any website and stores it into an open and free database for everyone. It contains petabytes of data and is regularly collected through time. Many people create a subset from this data repository into specific and cleaned ones. Due to the huge amount of data that can be mined, it becomes common to see large language models trained on these datasets.
Wikipedia
[web] [dataset] is an online encyclopedia that provides general and specific knowledge about basically everything from history to science in multiple languages. The website provides data dumps regularly (see). However, it is so noisy and you need to parse the structure of the page. Fortunately, there is a read dataset card on HuggingFace so you can use it directly to get the cleaned and structured one.
CodeSearchNet
[web] is a dataset that is crafted for code search tasks. The purpose of this dataset is to find the relevant code given a query. The dataset creation is quite simple, parse the documentation of the function below the function name. In Python, we call docstring which describes what the function does and the next line is the code to run the logic described in the docstring. The creation of this dataset takes the docstring as the query and the whole of the function is the expected output of the query. The original task of CodeSearchNet is to find relevant code and not generate code. However, I found it possible to generate such code using this dataset. However, it seems necessary to add an extra layer to correct the generated code (like grammar correction) and compile the code to make sure it is runable.
Instruction Tuning Datasets
Instruction Tuning Datasets aim to fine-tune a pre-trained model to be able to solve specific tasks. There are a lot of datasets in instruction tunning and we only cover the popular ones. The format of these types of datasets can be different. This is because there is no exact format for this type. However, the ability for LLM to generate text based on the prompt is not compromised by this.
Alpaca
[web] Alpaca is a Llama-2 fine-tuned on synthetic dataset from OpenAI’s text-davinci-003. The approach to produce this dataset is similar to the style of self-instruct with some modifications such as the model that was used to produce the instruction, # of instances (a sample that consists of instruction, input, and response).
OpenOrca
[dataset] is a reproduction of Orca that claims to be as good as GPT-3.5. Unfortunately, the researchers do not publish the artifacts. Then a group of people from Alignmentlab.ai reproduced the methods from the Orca paper. The dataset contains the FLANv2 collection dataset which that is from ChatGPT responses.
Conclusion
There are a lot of text datasets you can find on the internet, the usable ones can be found in HuggingFace hub datasets. You need to pay a good amount of attention to using those datasets because they may contain duplication when you combine them with another dataset.
Think like time series prediction
Interesting stuff