Why cleaning your data is the key to unlocking its real worth
Mining transforms data into knowledge. Without mining, there can be no patterns, no insight, and no business intelligence. Without data mining, data itself is just metrics – gathered and stored, but never fully exploited.
As an intermediate step between data collection and the production of strategic business insights, data mining employs machine learning and artificial intelligence to analyse massive data sets more effectively and with greater nuance than human teams could ever emulate. Moreover, as mining increasingly takes place in the cloud, rather than on-premises, it’s available to a broader range of enterprises at lower cost than ever before.
No organization that wants to maintain its edge can afford to fall behind.
Why is data cleaning important?
The phrase ‘rubbish in, rubbish out’ has been with us in various forms since the late 1950s, but it’s never been more relevant than today. The speed at which AI can process massive amounts of data means that, if it’s working on false assumptions, it can quickly deliver an equal cache of erroneous results. Worse, human operators or automated actions that trigger on thresholds within the output may see the vast quantities produced as confirmation of bias, while machine learning, which aims to iterate performance based on previous results, may further skew the intelligence it shapes.
Clean data is therefore a must – and embarking too early on the task of data mining can be a costly mistake. As the European Commission’s statistical office, Eurostat, warns, “not cleaning data can lead to a range of problems, including linking errors, model misspecification, errors in parameter estimation and incorrect analysis leading users to draw false conclusions.”
How to clean your data
The first step, then, is to identify potential problems and strategize appropriate solutions. If it’s merely a process of converting between base sets, changing sentence case or highlighting incomplete records, the process is largely trivial and easily automated. More complex issues, while still capable of being addressed in an automated manner, will require greater initial intervention – and a minority of records may need to be flagged for manual attention. For example, where data is incomplete, or contains records that are similar but not identical, deletion or amalgamation may need manual authorization unless the discrepancy falls below an established confidence threshold.
How does automated data cleaning work?
Designing an automated process for data cleaning therefore requires a thorough understanding of the data’s eventual use, which in turn will inform the schema used to store the data once it’s been cleaned. With these factors in mind, operators will need to define the parameters within which the clean-up process is conducted. Must the results conform to a strict pattern, for example, or should leeway be permitted? Once again, this can only be answered with reference to the final use case. If the processes that rely on the data will be unaffected by minor variations within the set, passing them unchanged may speed up its initial ingestion.
However, should the process require granular metrics, such as geo spatial data correct to six decimal points, the data cleaning phase will be more time-consuming – and potentially more costly – if the original data identifies a broader target area. The correct solution in such a scenario will depend on the business case. If the data is to be used in deciding where to site a new supermarket, padding the data (with three additional zeroes, for example) may suffice. If it’s to be used in targeting missiles, such a solution would be wildly inappropriate, dangerous and irresponsible.
This is, perhaps, an extreme use case, but in more general use, setting an error threshold against which the acceptability of the raw data can be gauged as good practice. Automated processes can then test how well the data complies with set criteria to generate a score and use this to determine which items require additional processing.
Data mining algorithms and techniques
Automated data mining is a repetitive, predictable process, in which the same algorithm is applied to multiple data points to generate insight. However, it is only effective if there is consistency both within and between the data sets used.
Consistency within a single set is easy to check – as when ensuring all dates have a four-digit year, for example, or that hyphens are always used in place of en- or em-dashes. Consistency between sets can also be tested, but the criteria may be less obvious in the first pass, and success can rely on intelligent techniques, not merely smart algorithms.
What about Inconsistencies in data?
As software engineer Omar Elgabry explains, “Inconsistency occurs when two values in the data set contradict each other. A valid age, say 10, might nott match with the marital status, say divorced.”
The scientist working with this data would need to take such inconsistencies into account, since the two data points are fundamentally incomparable while at the same time being incompatible. This is what we mean when we talk about consistency between sets. Although the marital status data point, and age, may each be correctly formatted and within the acceptable bounds of their individual sets, the example above demonstrates that one necessarily contradicts the other, and the algorithm used to score the data should thus flag it for cleaning (if possible) or review.
Accounting for irregularities
Raw data can exhibit any number of inconsistencies and imperfection caused by errors introduced at the point of creation, or as a result of technical corruption. However, data scientists have devised algorithms to help plug any gaps, or correct present but erroneous data points.
K-means clustering can be used to organize the data into clusters, the members of which are determined by their proximity to a reference value or calculated median. Analysis can then be performed on individual clusters, to generate insights based on multiple subsections of the overall data set.
Dealing with gaps in data
Where there are known or suspected gaps in the data, either as a whole or within clusters, an expectation-maximization (EM) algorithm can further be used to derive likelihoods within the set. Thus, breaking down a national survey by age or another common demographic, and applying EM would allow the data scientist to identify an increasingly representative mean point within the data. If used in market research, this could help in the formulation of products or policies likely to appeal to the identified demographic.
Similarly, should the data scientist want to examine which data points frequently appear together, an apriori algorithm can be used to identify pairs, trios and so on. Applied to the political arena, apriori might suggest to a candidate already championing welfare reform that like-minded voters would similarly be in favor of increased spending on public healthcare. Knowing this would help in the formulation of speeches and other campaign material in the closing stages of a close-run race.
How human error and behaviors are handled during data mining
Where the original data was gathered automatically, via telematics, or using IoT, it should exhibit a high degree of conformity. However, where human input was involved – and particularly where the input was voice- or keyboard-based – that’s less often the case.
In such instances, approximate string matching can be used to gauge the variance between actual and anticipated values. This will allow the AI underpinning the process to assign a degree of certainty to the data, on the basis of which it can make an appropriate decision. This certainty is determined by what’s known as its ‘edit distance’ – or, in plain English – the number of edits that need to be made to the data to achieve a high degree of certainty. The more edits that need to be made, the more ‘distant’ the raw data is from a likely result, so the more ‘approximate’ and less precise the string is.
Such techniques are routinely used to check spelling, filter messages for spam and deliver web search results, but they could equally differentiate between ‘engish’ and ‘engnering’ and rewrite them as English and engineering in the skill records of an employee database. However, without further information, it’s unlikely an algorithm could accurately assign ‘eng’ to either of these. Although it’s edit distance places it closer to English than it does to engineering, it is unlikely to achieve a high degree of confidence and further analysis will be required.
How clustering is handled during data mining
These are far from the only options and a full range of complimentary tools, algorithms and processes can be applied in sequence so that each improves on the output of that which it follows. Others, like hierarchical clustering, can be applied in different ways, depending on the required output. Bottom-up clustering, for example, can be used to incorporate diverse data points into an increasingly homogeneous entity should that be required, while a similar process, in reverse, can break out individual groups from a single data pool into more granular subsets if approached from top-down.
In the above example, the bottom-up approach would allow the organization to find which members of a group, like an identified set of potential customers, are more likely to respond to specific prompts. So, if it held detailed demographic data on 10,000 prospects, and a budget to produce five pieces of marketing material, it could assign each of its 10,000 leads to one of five groups with broadly similar characteristics and develop material to target each one.
Alternatively, if the material has already been produced, it may take the top-down approach, using characteristics of the five marketing campaigns to break down its 10,000-strong customer database to identify which cohorts are most likely to respond positively to the assets it already has to hand.
What skills are required for data mining?
“Data mining analysts turn data into information, information into insight and insight into business decisions,” explains the Data Science blog at Southern Methodist University.
However, data scientists also spend around half of their time cleaning data, at the very start of that chain. Skills in this field are therefore key requirements for anyone who wants to undertake any data mining. And, if the data is to remain useful into the future, data cleaning will need to be repeated periodically to mitigate data decay, which is the continual process by which data becomes less relevant as real world conditions stray from the static, stored metrics.
Data cleaning for compliance
Cleaning data for a second or third time may also be required not only because the data has become stale, but because the environment in which the data holder does business has been subject to regulatory changes. As Zoominfo notes, “with data and security laws more stringent than ever, cleaning your data is critical for staying compliant. You must follow opt-out processes. The last thing you want is a fine for blasting emails that violate GDPR or CCPA”.
Programming in data mining
Depending on the workflow, data mining may require competent programming skills. These will be used not only to manipulate the data in the delivery of strategic business insights, but during its initial processing, too. Such data is frequently managed using a language like R or Python, although many of the processes undertaken in either environment can be simplified through the use of intuitive front-ends. The Rattle GUI gives access to core functions within R for cleaning and mining data and is widely used within some national government departments – notably in Australia. Orange, for Python, allows users to set up flow-based pathways that move data through a range of processing nodes to isolate relevant metrics and identify patterns.
Useful data mining applications
Many stand-alone applications, like OpenRefine, which uses a spreadsheet metaphor and formulae to transform data, further reduces the code burden for operators. WinPure Clean & Match breaks down the process of cleansing data into specific sections – seven in total – allowing users to focus on each in turn while promising to be easy to use, even without specialized training.
Data science for business
Whether in sales, defense or electioneering, data mining is key to extracting strategic insight, gaining competitive advantage and planning for effective resource allocation. Data cleaning, as a key part of that process, is the factor by which its success is ultimately decided.
Data has never been easier to collect in large quantities, cheaper to store, nor as quick to analyze. As such, there have never been more opportunities than there are now to make mistakes by using data that is unclean, incomplete, superfluous, or which has become stale and degraded over time.
This should concern us, as an increasing number of business problems can be solved through data-analytic thinking, in which data forms the start point of a decision-making process, rather than being what’s gathered in retrospect to support a decision that was influenced by past experience, prejudice or preference.
As such, data-analytic thinking will only be successful – and productive – when the data on which it’s based is known to be accurate and true: two things that will only be possible if the importance of data cleaning to the overall process of data mining is recognized and respected.
Related Case Studies
-
01 /
A Bespoke Retail Data Solution for Better Insights and Forecasting
A pioneer in the retail industry with an online solution providing easy access to global retailer data, had the challenge of creating retailer profiles through the data capture of financial and operational location information.
-
02 /
A Unified Data Management Platform for Processing Sports Deals
A global intelligence service provider was facing challenge with lack of a centralised data management system which led to duplication of data, increased effort and the risk of manual errors.