As a QA engineer, ensuring the system has realistic data that reflects real-world scenarios is a critical challenge. But what happens when the data platform you rely on doesn’t have a development environment? That was exactly the situation I faced while testing a financial institution's new tiering system.

Our goal was to generate 1 million records to validate the system’s performance and functionality, simulating real customer behavior. However, there was a challenge—the platform that sends the data didn’t have a development environment, meaning we couldn’t directly pull the test data we needed. 

Faced with this roadblock, we had to rethink our approach. Instead of relying on existing data, we delved into data generation and built a solution that worked on our terms.

The Real Challenge: Generating Test Data from Scratch

When we realized we couldn’t rely on an existing dev environment, I knew we had to think outside the box. The challenge wasn’t just about generating data—it had to be realistic, interconnected, and scalable. This wasn’t a one-off test. To validate the system under realistic load, our data had to mimic complex production relationships.

The stakes were high. Without accurate, interconnected test data, we couldn’t assess the system’s performance under pressure. And without a dev environment to pull data from, we had only one option: create it from scratch.

How I Solved the Data Deficit Challenge

Once we understood the challenge, I collaborated with our data engineering experts to generate the required data using Python. The solution we came up with wasn’t just a generic data dump—it was a carefully crafted set of interconnected tables (Customers, Accounts, and Transactions) designed to reflect real-world scenarios. 

Building Relationships: The Key to Realistic Test Data

The foundation of the system we were testing was built on relationships—customers have accounts, and accounts have transactions. The customer ID had to link to its associated accounts, and transactions needed to be tied to the right accounts. I knew that maintaining these relationships was critical for realistic testing.

Step 1: Generating Realistic Customer Data with Faker

I used the Faker library to generate random but realistic customer information. The Faker library is a powerful tool for generating diverse, real-world data—names, addresses, emails, and phone numbers, making it a perfect fit for creating realistic test data. 

Here’s how I used the Faker library to generate realistic customer data:

# Initialize Faker

fake = Faker()

# Generate a single customer's data

def generate_customer():

    return {

        'customer_id': fake.uuid4(),

        'first_name': fake.first_name(),

        'last_name': fake.last_name(),

        'email': fake.email(),

        'phone_number': fake.phone_number(),

        'address': fake.address(),

        'account_balance': fake.random_number(digits=5)

    }

# Example of generating one customer

customer_data = generate_customer()

print(customer_data)

This function creates a realistic customer with a unique customer ID, name, email, phone number, and even an address. The account_balance simulates the amount of money the customer may have in their account.

Step 2: Creating Accounts and Linking Them to Customers

Next, I needed to simulate the customer’s accounts. Each customer could have between 1 and 3 accounts. I used the customer’s customer_id to maintain the relationship between the customers and their accounts.

import random

# Generate account data

def generate_account(customer_id):

    num_accounts = random.randint(1, 3)  # Between 1 to 3 accounts per customer

    accounts = []

    for _ in range(num_accounts):

        account_data = {

            'account_id': fake.uuid4(),

            'customer_id': customer_id,

            'account_type': random.choice(['Savings', 'Checking', 'Business']),

            'balance': fake.random_number(digits=4)

        }

        accounts.append(account_data)

    return accounts

# Generate 3 accounts for a customer

accounts_data = generate_account(customer_data['customer_id'])

print(accounts_data)

Here, each account is tied to a unique account_id, and the customer_id ensures that the accounts are linked back to the customer. The account type and balance are randomly chosen to simulate different account scenarios.

Step 3: Creating Transactions for Each Account

The last part of the puzzle was to generate transactions for each account. Each account needed between 5 to 10 transactions, and each transaction had a unique transaction ID and date. I ensured the transaction data was spread over the past decade to simulate realistic customer behavior.

# Generate transaction data for an account

def generate_transactions(account_id):

    num_transactions = random.randint(5, 10)  # Between 5 to 10 transactions per account

    transactions = []

    for _ in range(num_transactions):

        transaction_data = {

            'transaction_id': fake.uuid4(),

            'account_id': account_id,

            'amount': fake.random_number(digits=3),  # Random amount for each transaction

            'date': fake.date_this_decade(),  # Random date in the last decade

            'transaction_type': random.choice(['Deposit', 'Withdrawal', 'Transfer'])

        }

        transactions.append(transaction_data)

    return transactions

# Generate transactions for an account

transactions_data = generate_transactions(accounts_data[0]['account_id'])

print(transactions_data)

Here, each transaction is tied to the account ID, and I used fake.date_this_decade() to ensure that transactions are spread across a realistic timeline.

Step 4: Scaling and Automating the Data Generation

To generate 1 million records, I automated the process and scaled it using Databricks, a cloud-based platform that helped us process the data efficiently.

import csv

# Function to save data to CSV files

def save_to_csv(customers, accounts, transactions):

    with open('customers.csv', 'w', newline='') as f:

        writer = csv.DictWriter(f, fieldnames=customers[0].keys())

        writer.writeheader()

        writer.writerows(customers)

    with open('accounts.csv', 'w', newline='') as f:

        writer = csv.DictWriter(f, fieldnames=accounts[0].keys())

        writer.writeheader()

        writer.writerows(accounts)

    with open('transactions.csv', 'w', newline='') as f:

        writer = csv.DictWriter(f, fieldnames=transactions[0].keys())

        writer.writeheader()

        writer.writerows(transactions)

# Example data to save (a small subset for illustration)

save_to_csv([customer_data], accounts_data, transactions_data)

This function exports the generated customer, account, and transaction data into CSV files, allowing easy import into the test environment.

The Results: Data Generation Like Never Before

By the end of the process, we had succeeded in generating:

  • Realistic, interconnected test data that mirrored actual customer behavior and relationships.
  • A scalable solution capable of handling 1 million records and beyond.
  • A streamlined, automated process that saved time and allowed the QA team to focus on optimizing the system’s performance.

With the data generated and loaded, we were able to test the system’s performance under realistic conditions. This gave us confidence that we were validating the system with diverse, well-structured data that accurately mimicked production.

Key Takeaways: Why Realistic Data Matters in QA Testing

Looking back, this experience taught me a lot about the power of realistic test data. It’s not just about populating tables with random numbers; it’s about crafting data that reflects how the system will interact with users in the real world. By preserving relationships between customers, accounts, and transactions, we generated data that helped test and understand real-world system behavior.

While we didn’t use dbldatagen—a powerful library from Databricks for generating large-scale synthetic data—this is certainly a tool worth considering for generating big datasets. However, for our specific case, we needed more fine-grained control over the relationships between data entities (such as customers, accounts, and transactions). dbldatagen didn’t provide the level of customization we required, so we opted for a more tailored, Python-based solution that let us define custom data relationships and simulate real-world behavior more effectively. 

No Image
Lead Engineer, QA