Non-Production Data – does anyone have the answer?
Data in your Test systems can be a difficult thing to get right as there are conflicting requirements. We often see companies struggle with their non-Production (or ‘Test’) data.
A Test environment should be seen as training for the real world –Production (or ‘Live’). If you were entering a Marathon but only trained for a 10k, you’d be luck to make it round half way as your training for the event wouldn’t have been sufficient preparation.The same is true for your applications. Your Test data needs to be a fair representation of what will be found in Live, otherwise it may fall flat on reaching the Production environment.
What you must also do is ensure Personal Identifiable Information (PII) is not present in the Test System. Sensitive Customer data should be treated with the utmost respect. It is given to you as an organisation with the implicit trust and explicit permission of the owner (the Customer). And the trust of the Customer is something you do not want to abuse. GDPR also states that Customer data must only be used for a defined ‘purpose’. Altering or extending that‘purpose’ must have explicit consent from the Customer.
People are generally less security-conscious with non-production systems –Users are more likely to be administrators, share generic passwords,have weak passwords etc. Systems are less likely to be locked down by IP white list. For all these reasons, it is imperative that all PII is absent from non-Production source data. Just for the record, we do advise that security for non-Production systems is at the same level as a Production system.
Any ISO27001 auditor will also take a very dim view of actual Customer data in a Test system, resulting in red flags and mitigation actions being raised at the very least.
Another significant risk is where Customers are inadvertently sent email or SMS messages from Test systems. This can be a huge own goal for an organisation, causing enormous confusion and frustration among Customers.
Data must also be fresh. Its no use the data in your Test system being 2 years old when your application is only interested in dates form the past 30 days. To be meaningful, dates must be dynamically generated or updated when they are applied to your Test system.
- Our data isn’t up to date so we can’t do proper testing
- The dates are all from ages ago so we can’t work with it properly
- We carry out UAT in our Test system but when we put code changes Live, we often find problems occur that we didn’t spot in Test
- We’ve never refreshed the data in non-Production because its too difficult
Various options exist when refreshing non-Production data, each with their pros and cons. There’s no silver bullet here – its a matter of selecting and working with the options that works best for your organisation, budget and time constraints. It also depends what type of testing you are doing.
Synthetic Targeted Test Data
It’s possible to create scripts that generate sets of test data that fit the range of scenarios you want to test. This would involve first identifying and documenting the scenarios then writing the scripts to generate the required data.
- Same results every time
- No chance of PII creeping into the data or inadvertently sending Customers test email or test SMS messages
- Quick to run.
- You can define specific test cases that represent business scenarios, e.g. you might want to create a Customer called Mr Customer-Who-Has-Never-Bought-Anything to test this specific scenario, then its obvious to everyone what the purpose of this Customer is in the system.
- Schema changes (changes to table designs, structure and formats will usually require a change to these scripts)
- You’ll need to be aware of the scenarios you want to use as test cases. If you have a comprehensive understanding of the operational environment this might work for you.
- Can be time-consuming to create
Random Data Generators
There are some products on the market that generate random test data. This can be useful but also be hit and miss. It will neither have the complexity of Live data or the targeted patterns of the Synthetic Targeted approach but is a quick way of generating high volumes of test data safely. Could be a good option where boundary testing is the main aim, where the minimum and maximum ranges, lengths and values are tested or where no Production data yet exists (i.e. in a brand-new system).
- No risk of PII in test data
- Can create large volumes of data quickly
- Can be carried out by non-technical staff and simply run multiple times
- Can provide the full range of data values in any field
- No consistency due to random nature;
- No targeting of specific scenarios;
- Complexity of Live data may not be matched
Refresh from Live
This option is good if you also need a representative volume of data in your non-Production systems, for example if you’re carrying out performance or load tests. You’ll get an exact copy of Live which itself has some intrinsic pros and cons. Note: Anonymisation is an essential (not optional) component of this approach.
- representative volume and complexity of Live, assuming your Live system is mature enough and already has a good range of real-world data
- If you need to identify and test specific scenarios, you may find it difficult to find existing data that you can use as a Test case
- Volume of data may be a problem – you may not want thousands of customers for example, so this might not be a good fit
- Anonymisation of data is essential. All instances of Customer Names, addresses, phone numbers , emails addresses must be located and anonymised.
- Potential that some PII is missed and creeps into a Non-Production system causing a potential security risk
A combination of two or more approaches might be a good option for those that need the benefits of each, mitigating most of the cons,but his will certainly be the costly to manage and operate.
A Golden Copy of data is a database that you’ve spent time grooming and getting into great shape. It is likely a mix of anonymised Live data and synthetic manufactured data. The dates will be dynamic so you can deploy it with confidence knowing it will be fresh each time.This is probably the Holy Grail that you could aim for as it has the benefits and few of the drawbacks of the options listed here. But it will need investment in time and effort to get it right and keep it relevant.
There are various options to populating non-Production data, but there are some absolute rules that must be followed. No Personal Identifiable Information (PII) must be present. Taking a pragmatic,forward-thinking approach and treating non-Production data as an asset means you’ll be able to manage this function effectively without creating additional security risks and reap the benefits far into the future.