Sythetic Data Kit is a tool from Meta LLAMA that helps you generate high-quality synthetic datasets for fine-tuning large language models (LLMs). It simplifies the process of preparing data for fine-tuning by providing a command-line interface (CLI) with a modular four-command flow.
One of the key features of the synthetic-data-kit
is its use of the Lance format for storing and ingesting datasets. This allows for efficient storage and retrieval of data, which is crucial when working with large datasets.
Key Features:
- Data Ingestion: The toolkit can ingest various file formats, including PDF, HTML, YouTube transcripts, DOCX, PPT, and TXT.
- Fine-tuning Format Creation: It can create different fine-tuning formats, such as question-answer (QA) pairs, QA pairs with Chain-of-Thought (CoT), and summarization formats.
- Data Curation: The tool uses Llama as a judge to curate high-quality examples, ensuring the quality of the generated dataset.
- Flexible Saving Options: You can save the generated datasets in various formats compatible with your fine-tuning workflow, including Hugging Face, JSONL, and JSON.
How it Works:
The synthetic-data-kit follows a simple four-step process:
- Ingest: Import your input files into the toolkit. The data is stored in the Lance format for efficient processing.
- Create: Generate diverse fine-tuning datasets, such as reasoning, summarization, and QA pairs, from the ingested documents.
- Curate: Use Llama to filter and select high-quality examples from the generated dataset.
- Save-as: Export the curated dataset in your preferred format.
Usage
The synthetic-data-kit
uses Lance format to store and manage the data that you ingest. The workflow is a series of commands that build on each other, starting with the ingest
command.
Here is an example of the end-to-end workflow:
-
Ingest Data into a LanceDB dataset
This command takes a directory of source files and creates a LanceDB dataset from them.
synthetic-data-kit ingest docs/report.pdf --multimodal # This will create a Lance dataset at data/parsed/report.lance # with 'text' and 'image' columns. #Generate multimodal-qa pairs from the ingested data synthetic-data-kit create data/parsed/report.lance --type multimodal-qa
-
Create fine-tuning data
This command uses the LanceDB dataset created in the previous step to generate synthetic data in the desired format.
synthetic_data create data/parsed/report.lance
-
Curate the data
This step uses a language model to curate the generated data and ensure its quality.
synthetic_data curate report.json
-
Save the final dataset
Finally, save the curated data to a file in the desired format.
synthetic_data save-as report.json --save_path ./my_finetuning_data.jsonl
This workflow allows you to go from a collection of documents to a high-quality, fine-tuning dataset with just a few commands. The use of LanceDB in the background makes the process efficient and scalable.
Getting Started:
To get started with the synthetic-data-kit, you can clone the GitHub Repository and install the necessary dependencies.