Sythetic Data Kit is a tool from Meta LLAMA that helps you generate high-quality synthetic datasets for fine-tuning large language models (LLMs). It simplifies the process of preparing data for fine-tuning by providing a command-line interface (CLI) with a modular four-command flow.
One of the key features of the synthetic-data-kit is its use of the Lance format for storing and ingesting datasets. This allows for efficient storage and retrieval of data, which is crucial when working with large datasets.
The synthetic-data-kit follows a simple four-step process:
The synthetic-data-kit uses Lance format to store and manage the data that you ingest. The workflow is a series of commands that build on each other, starting with the ingest command.
Here is an example of the end-to-end workflow:
Ingest Data into a LanceDB dataset
This command takes a directory of source files and creates a LanceDB dataset from them.
synthetic-data-kit ingest docs/report.pdf --multimodal
# This will create a Lance dataset at data/parsed/report.lance
# with 'text' and 'image' columns.
#Generate multimodal-qa pairs from the ingested data
synthetic-data-kit create data/parsed/report.lance --type multimodal-qaCreate fine-tuning data
This command uses the LanceDB dataset created in the previous step to generate synthetic data in the desired format.
synthetic_data create data/parsed/report.lanceCurate the data
This step uses a language model to curate the generated data and ensure its quality.
synthetic_data curate report.jsonSave the final dataset
Finally, save the curated data to a file in the desired format.
synthetic_data save-as report.json --save_path ./my_finetuning_data.jsonlThis workflow allows you to go from a collection of documents to a high-quality, fine-tuning dataset with just a few commands. The use of LanceDB in the background makes the process efficient and scalable.
To get started with the synthetic-data-kit, you can clone the GitHub Repository and install the necessary dependencies.