Generating realistic fake data using Snowfakery

5 min readFeb 26, 2022

In the process of application development, fake data is often used, which is needed to demonstrate or test functionality. You can, of course, create data manually, but if you need to generate a large amount, and even with an internal connection inside according to a certain logic, then it makes sense to use ready-made utilities or services.

There are a huge number of online services, tools, utilities, and frameworks for generating fake data. I would like to focus on the following things, which from my point of view are the most popular and allow you to generate data: Pydbgen (open-source), Faker (this is an open-source framework that exists for Python, Ruby, JS), and Snowfakery (this utility was created by Salesforce and is also open-source).

What is Snowfakery?

Snowfakery — is an open-source tool for the generation of large-scale fictitious data with support for relationships between entities. With this tool, you can generate a unique dataset, where each row will be a dummy record, but each row is unique, like a snowflake.

This tool is suitable for creating data in CSV, SQL, bulk insert, or JSON file formats. Here are some interesting features of this tool:

allows us to easily scale data up to millions of rows
allows us to enrich real data with plugins (for example geo-data, postal addresses, names, and so on)
automatic management of links between objects
clear syntax

How to use this tool?

To install Snowfakery you will need some kind of virtual environment, such as VirtualEnv or Anaconda. To install in Anaconda, you should run the following commands:

$ conda install pipx
$ pipx install snowfakery

If you are using VirtualEnv, you can simply use the command pip instead of conda. After that, you can run a YAML script that describes the objects and their dependencies.

$ snowfakery script.yml

It remains to create this script to generate data.

Generating a dataset

Let’s try to create something more or less realistic. For example, data for an online store. To do this, let’s create a script that will allow us to generate data according to the following scheme

Also, for realism, we will add the following business rules:

The minimum price of the product is $5, and the maximum is $60
Only 35% of registered users buy. 60% of purchases contain one product, 30% — two products, 8% — three products, and 2% — 4 products
60% of users do not fill out their profiles completely, and we do not know full information about them. From the remaining profiles, we know that 25% are women, and 15% are men
The best-selling volume is 1 fl. oz.
The dataset will contain dates for the last 90 days

The requirements are ready. Well, let’s go. Here are the main points for understanding how to create a script(for more detailed information, please, read the documentation). The link to the full code will be at the bottom of the article.

Setting the language

Let’s set the language. This means that fake data will be generated using this locale

- var: snowfakery_locale
  value: en_US

Entities

Now let’s add objects, according to the graph. An entity can have the property set that it will be generated only once (just_once), or you can set the number of records that you need (count). Here is an example of how you can generate Сategories — Products tables. The Сategory table will have only two records. There will be 250 records in the Product table, which will be randomly distributed between the two categories. I would like to note the fact that objects with the same name in the script will be entered in the same table.

- object: Category
  just_once: True
    fields:
      Name: Unique category name 1
- object: Category
  just_once: True
    fields:
      Name: Unique category name 2
- object: Product
  count: 250
    CategoryId:
      random_reference: Category

Entity fields

Fields can be generated in different ways. For example, you can set a static value or call a fake data generator (fake). Or also you can call a random selection with a given probability. In the example below, we will create 500 customer records, aged from 16 to 65, which will belong to the customers group. 40% will be male, and 60% will be female. The email will be generated in such a way that it will look like the real one (be careful, there may be matches with real addresses). Fake names will be generated depending on gender.

- object: Contact
  count: 500
  fields:
    Age:
      pick:
        random_number:
          min: 16
          max: 65
    Group: customers
    Email:
      fake: RealisticMaybeRealEmail
    Gender:
      random_choice:
       - choice:
           probability: 40%
           pick: Male
       - choice:
           probability: 60%
           pick: Female
    FirstName:
      if:
        - choice:
            when: ${{Gender=='Male'}}
              pick:
                fake: FirstNameMale
        - choice:
            when: ${{Gender=='Female'}}
              pick:
                fake: FirstNameFemale

References between entities

To generate a link in the form of an ‘ID’ field between entities, you can use random_reference

- object: Product
  count: 250
  fields:
    CategoryId:
      random_reference: Category
    BrandId:
      random_reference: Brand

or you can use friends to generate child elements.

- object: Order
  count:
    random_choice:
      - choice:
          probability: 65%
          pick: 0
      - choice:
          probability: 35%
          pick:
          random_number:
            min: 1
            max: 10
   friends:
       # products in order
       - object: OrderItems
         count:
           random_choice:
             - choice:
                 probability: 60%
                 pick: 1
             - choice:
                 probability: 30%
                 pick: 2

Now putting everything together, and taking into account the above rules and requirements, you can get approximately the following code for generating contacts

This part of the script above generates 500 contacts according to the requirements and will automatically add dependent objects. The full code is quite long, so I didn’t add it here. You can download it from this link. There you can see how other tables are created.

Data generation

To generate a set as CSV files, use the following command:

$ snowfakery fake-dataset.yml --output-format csv

That’s what we got in the end:

You can read more about all the features and possibilities of generation in the documentation. For example, as an output parameter, you can generate a model in the form of an image with graphs.

References

Repository:

GitHub - koav/SnowfakerySample

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Sources:

GitHub - SFDO-Tooling/Snowfakery: A tool for generating fake data that has relations between…

Snowfakery is a tool for generating fake data that has relations between tables. Every row is faked data, but also…

github.com

Snowfakery documentation

Snowfakery is a tool for generating fake data that has relations between tables. Every row is faked data, but also…

snowfakery.readthedocs.io

Generating realistic fake data using Snowfakery

What is Snowfakery?

How to use this tool?

Generating a dataset

References between entities

Data generation

References

GitHub - koav/SnowfakerySample

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

GitHub - SFDO-Tooling/Snowfakery: A tool for generating fake data that has relations between…

Snowfakery is a tool for generating fake data that has relations between tables. Every row is faked data, but also…

Snowfakery documentation

Snowfakery is a tool for generating fake data that has relations between tables. Every row is faked data, but also…

Written by Andrei Kaliada

No responses yet