Test Data Management: moving beyond the logistics
Is the right data available throughout a sprint?
More and more organizations we work with are finding that a lack of quality data and slow, manual data provisioning are preventing them from meeting their tight and continuous release deadlines. No matter how fast code is deployed, or how quickly software is built automatically, testing is lagging behind, as projects and sprints over-run.
Though the concept of test data management is not new, this increasing awareness of its critical role in delivering quality software on time is. However, common conceptions of test data management are yet to catch up with this awareness.
Many approaches remain purely logistical, even when they promise to provision all the data needed by testers, while driving down infrastructure costs, improving quality and ensuring regulatory compliance.
The focus is still on copying production data as quickly as possible to test environments, but this will leave testers facing many of the same challenges as before, both in terms of quality and efficiency.
Starting with quality, production data is highly unlikely to contain all the data combinations needed for rigorous testing. It is drawn from past, “business as usual” transactions, and is sanitized by its very nature to exclude bad data which will cause a system to collapse.
As a consequence, sampled production data is almost always out-of-date from a testing perspective, and provides just 10-20% functional test coverage. It does not include unexpected results, future scenarios or outliers, when it is these which are most likely to cause a system to collapse.
What’s more, when faced with unwieldy copies of high volume, low variety production data, testers frequently spend far too long searching for the exact data sets they need, or creating any missing data by hand. The limited number of copies of production data are further not often available in parallel, so that more time is wasted waiting for data to become available ‘upstream’.
Any organization serious about Continuous Delivery must look beyond this logistical approach to Test Data Management, and this linear approach to test data provisioning. Their test data management strategy in turn must go beyond sampling, masking and subsetting production data.
Firstly, sampled production data must either be supplemented or replaced with the additional test data variables needed to achieve 100% test data coverage. As mentioned, manual data creation is time-consuming and complex, but automated synthetic test data generation can be used to systematically and quickly create all the data needed for testing.
Data should be generated based upon a model of production data, to ensure that the referential integrity of the data needed for testing is maintained. Adopting a coverage-driven approach, any missing variables needed for rigorous testing can also be identified from the modelled production data, and data visualization might be used too.
This data must then be stored in such a way that the exact combination of variables needed for a specific test is available instantly, throughout a sprint. Taking a parameterized approach to data storage, exact variables within the stored data can be requested, combined and delivered on demand.
Ideally, this should be self-service and automated, while data should be copied as it is provisioned to make it available in parallel. As only the exact data set needed for a given task is delivered, infrastructure costs should not skyrocket, while testers will not waste time looking for data.
What’s more, data becomes fully re-usable in this approach, so that it is not lost during a refresh, and does not become redundant with new releases. In other words, testers have all the data they need to execute 100% of possible tests, on demand and in parallel.