Generating realistic fake data using Snowfakery
In the process of application development, fake data is often used, which is needed to demonstrate or test functionality. You can, of course, create data manually, but if you need to generate a large amount, and even with an internal connection inside according to a certain logic, then it makes sense to use ready-made utilities or services.
There are a huge number of online services, tools, utilities, and frameworks for generating fake data. I would like to focus on the following things, which from my point of view are the most popular and allow you to generate data: Pydbgen (open-source), Faker (this is an open-source framework that exists for Python, Ruby, JS), and Snowfakery (this utility was created by Salesforce and is also open-source).
What is Snowfakery?
Snowfakery — is an open-source tool for the generation of large-scale fictitious data with support for relationships between entities. With this tool, you can generate a unique dataset, where each row will be a dummy record, but each row is unique, like a snowflake.
This tool is suitable for creating data in CSV, SQL, bulk insert, or JSON file formats. Here are some interesting features of this tool:
- allows us to easily scale data up to millions of rows
- allows us to enrich real data with plugins (for example geo-data, postal addresses, names, and so on)
- automatic management of links between objects
- clear syntax
How to use this tool?
To install Snowfakery you will need some kind of virtual environment, such as VirtualEnv or Anaconda. To install in Anaconda, you should run the following commands:
$ conda install pipx
$ pipx install snowfakery
If you are using VirtualEnv, you can simply use the command pip instead of conda. After that, you can run a YAML script that describes the objects and their dependencies.
$ snowfakery script.yml
It remains to create this script to generate data.
Generating a dataset
Let’s try to create something more or less realistic. For example, data for an online store. To do this, let’s create a script that will allow us to generate data according to the following scheme
Also, for realism, we will add the following business rules:
- The minimum price of the product is $5, and the maximum is $60
- Only 35% of registered users buy. 60% of purchases contain one product, 30% — two products, 8% — three products, and 2% — 4 products
- 60% of users do not fill out their profiles completely, and we do not know full information about them. From the remaining profiles, we know that 25% are women, and 15% are men
- The best-selling volume is 1 fl. oz.
- The dataset will contain dates for the last 90 days
The requirements are ready. Well, let’s go. Here are the main points for understanding how to create a script(for more detailed information, please, read the documentation). The link to the full code will be at the bottom of the article.
Setting the language
Let’s set the language. This means that fake data will be generated using this locale
- var: snowfakery_locale
value: en_US
Entities
Now let’s add objects, according to the graph. An entity can have the property set that it will be generated only once (just_once), or you can set the number of records that you need (count). Here is an example of how you can generate Сategories — Products tables. The Сategory table will have only two records. There will be 250 records in the Product table, which will be randomly distributed between the two categories. I would like to note the fact that objects with the same name in the script will be entered in the same table.
- object: Category
just_once: True
fields:
Name: Unique category name 1
- object: Category
just_once: True
fields:
Name: Unique category name 2
- object: Product
count: 250
CategoryId:
random_reference: Category
Entity fields
Fields can be generated in different ways. For example, you can set a static value or call a fake data generator (fake). Or also you can call a random selection with a given probability. In the example below, we will create 500 customer records, aged from 16 to 65, which will belong to the customers group. 40% will be male, and 60% will be female. The email will be generated in such a way that it will look like the real one (be careful, there may be matches with real addresses). Fake names will be generated depending on gender.
- object: Contact
count: 500
fields:
Age:
pick:
random_number:
min: 16
max: 65
Group: customers
Email:
fake: RealisticMaybeRealEmail
Gender:
random_choice:
- choice:
probability: 40%
pick: Male
- choice:
probability: 60%
pick: Female
FirstName:
if:
- choice:
when: ${{Gender=='Male'}}
pick:
fake: FirstNameMale
- choice:
when: ${{Gender=='Female'}}
pick:
fake: FirstNameFemale
References between entities
To generate a link in the form of an ‘ID’ field between entities, you can use random_reference
- object: Product
count: 250
fields:
CategoryId:
random_reference: Category
BrandId:
random_reference: Brand
or you can use friends to generate child elements.
- object: Order
count:
random_choice:
- choice:
probability: 65%
pick: 0
- choice:
probability: 35%
pick:
random_number:
min: 1
max: 10
friends:
# products in order
- object: OrderItems
count:
random_choice:
- choice:
probability: 60%
pick: 1
- choice:
probability: 30%
pick: 2
Now putting everything together, and taking into account the above rules and requirements, you can get approximately the following code for generating contacts
This part of the script above generates 500 contacts according to the requirements and will automatically add dependent objects. The full code is quite long, so I didn’t add it here. You can download it from this link. There you can see how other tables are created.
Data generation
To generate a set as CSV files, use the following command:
$ snowfakery fake-dataset.yml --output-format csv
That’s what we got in the end:
You can read more about all the features and possibilities of generation in the documentation. For example, as an output parameter, you can generate a model in the form of an image with graphs.
References
Repository:
Sources: