The state of information high quality in 2020 – O’Reilly


We suspected that information high quality was a subject brimming with curiosity. These suspicions had been confirmed after we shortly acquired greater than 1,900 responses to our mid-November survey request. The responses present a surfeit of considerations round information high quality and a few uncertainty about how finest to handle these considerations.

Key survey outcomes:

Be taught sooner. Dig deeper. See farther.

  • The C-suite is engaged with information high quality. CxOs, vice presidents, and administrators account for 20% of all survey respondents. Information scientists and analysts, information engineers, and the individuals who handle them comprise 40% of the viewers; builders and their managers, about 22%.
  • Information high quality may worsen earlier than it will get higher. Comparatively few organizations have created devoted information high quality groups. Simply 20% of organizations publish information provenance and information lineage. Most of those that don’t say they don’t have any plans to begin.
  • Adopting AI will help information high quality. Virtually half (48%) of respondents say they use information evaluation, machine studying, or AI instruments to handle information high quality points. These respondents usually tend to floor and deal with latent information high quality issues. Can AI be a catalyst for improved information high quality?
  • Organizations are coping with a number of, simultaneous information high quality points. They’ve too many various information sources and an excessive amount of inconsistent information. They don’t have the assets they should clear up information high quality issues. And that’s only the start.
  • The constructing blocks of information governance are sometimes missing inside organizations. These embrace the fundamentals, comparable to metadata creation and administration, information provenance, information lineage, and different necessities.

The highest-line excellent news is that folks in any respect ranges of the enterprise appear to be alert to the significance of information high quality. The highest-line dangerous information is that organizations aren’t doing sufficient to handle their information high quality points. They’re making do with insufficient—or non-existent—controls, instruments, and practices. They’re nonetheless combating the fundamentals: tagging and labeling information, creating (and managing) metadata, managing unstructured information, and so forth.

Respondent demographics

Analysts and engineers predominate

Almost one-quarter of respondents work as information scientists or analysts (see Determine 1). An extra 7% are information engineers. On prime of this, shut to eight% handle information scientists or engineers. That signifies that about 40% of the pattern consists of front-line practitioners. That is hardly stunning. Analysts and information engineers are, arguably, the individuals who work most carefully with information.

In observe, nevertheless, nearly each information scientist and analyst additionally doubles as an information engineer: she spends a big proportion of her time finding, getting ready, and cleansing up information to be used in evaluation. On this method, information scientists and information analysts arguably have a private stake in information high quality. They’re usually the primary to floor information high quality issues; in organizations that should not have devoted information high quality groups (or analogous assets, comparable to information high quality facilities of excellence), analysts play a number one position in cleansing up and correcting information high quality points, too.

Roles of survey respondents
Determine 1. Roles of survey respondents.

A switched-on C-suite?

Respondents who work in higher administration—i.e., as administrators, vice presidents, or CxOs—represent a mixed one-fifth of the pattern. That is stunning. These outcomes recommend that information high quality has achieved salience of some form within the minds of upper-level administration. However what sort of salience? That’s a difficult query.

Position-wise, the survey pattern is dominated by (1) practitioners who work with information and/or code and (2) the individuals who straight handle them—most of whom, notionally, even have backgrounds in information and/or code. This final level is vital. An individual who manages an information science or information engineering crew—or, for that matter, a DevOps or AIOps observe—features for all intents and functions as an interface between her crew(s) and the particular person (additionally usually a supervisor) to whom she straight stories. She’s “administration,” however she’s nonetheless on the entrance line. And she or he possible additionally groks the sensible, logistical, and political points that (of their intersectionality) mix to make information high quality such a thorny downside.

Executives deliver a distinct, transcendent, perspective to bear in assessing information high quality, notably with respect to its impression on enterprise operations and technique. Executives see the large image, not solely vis-à-vis operations and technique, but additionally with respect to issues—and, particularly, complaints—within the models that report back to them. Government buy-in and help is normally seen as one of many pillars of any profitable information high quality program as a result of information high quality is extra a people-and-process-laden downside than a technological one. It isn’t simply that totally different teams have differing requirements, expectations, or priorities on the subject of information high quality; it’s that totally different teams will go to conflict over these requirements, expectations, and priorities. Information high quality options nearly at all times boil down to 2 large points: politics and price. Some group(s) are going to have to alter the best way they do issues; the cash to pay for information high quality enhancements should come out of this or that group’s finances.

Government curiosity is usually a helpful—if not infallible—proxy for a company’s posture with respect to information high quality. Traditionally, the chief who understood the significance of information high quality was an exception, with few enlightened CxOs spearheading information high quality initiatives or serving to kick-start an information high quality heart of excellence. Whether or not as a consequence of organizations turning into extra information pushed, or the elevated consideration paid to the results of information high quality on AI efforts, elevated C-suite buy-in is a constructive growth.

Organizational demographics

About half of survey respondents are based mostly in North America. Barely greater than 1 / 4 are in Europe—inclusive of the UK—whereas about one-sixth are in Asia. Mixed, respondents in South America and the Center East account for just below 10% of the survey pattern.

Drilling down deeper, nearly two-fifths of the survey viewers works in tech-laden verticals comparable to software program, consulting/skilled providers, telcos, and computer systems/{hardware} (Determine 2). This may impart a slight tech bias to the outcomes. Alternatively, between 5% and 10% of respondents work in every of a broad swath of different verticals, together with: healthcare, authorities, increased training, and retail/e-commerce. (“Different,” the second largest class, with about 15% of respondents, encompasses greater than a dozen different verticals.) So concern about tech-industry bias might be offset by the truth that nearly all industries are, in impact, tech-dependent.

Industries of survey respondents 
Determine 2. Industries of survey respondents.

Dimension-wise, there’s a great combine within the survey base: practically half of respondents work in organizations with 1,000 workers or extra; barely greater than half, at organizations with 1,000 workers or much less.

Organization size 
Determine 3. Group dimension.

Information high quality points and impacts

We requested respondents to pick from amongst a listing of frequent information high quality issues. Respondents had been inspired to pick all points that apply to them (Determine 4).

Figure 4: Data quality survey 20. Primary data quality issues faced by respondents’ organizations. 
Determine 4. Main information high quality points confronted by respondents’ organizations.

Too many information sources, too little consistency

By a large margin, respondents fee the sheer preponderance of information sources as the one commonest information high quality challenge. Greater than 60% of respondents chosen “Too many information sources and inconsistent information,” adopted by “Disorganized information shops and lack of metadata,” which was chosen by just below 50% of respondents (Determine 4).

There’s one thing else to consider, too. This was a select-all-that-apply-type query, which signifies that you’d anticipate to see some inflation for the very first possibility on the listing, i.e., “Poorly labeled information,” which was chosen by just below 44% of respondents. Selecting the primary merchandise in a select-all-that-apply listing is a human habits statisticians have discovered to anticipate and (if crucial) to manage for.

However “Poorly labeled information” was truly the fifth commonest downside, trailing not solely the problems above, however “Poor information quality control at information entry” (chosen by near 47%) and “Too few assets accessible to handle information high quality points” (chosen by lower than 44%), as properly. Alternatively, the mix of “Poorly labeled information” and “Unlabeled information” tallies near 70%.

There’s good and dangerous on this. First, the dangerous: decreasing the variety of information sources is difficult.

IT fought the equal of a rear-guard motion towards this very downside by a lot of the Nineties and 2000s. Information administration practitioners even coined a time period—“spreadmart hell”—to explain what occurs when a number of totally different people or teams keep spreadsheets of the identical information set. The self-service use case helped exacerbate this downside: the primary era of self-service information evaluation instruments eschewed options (comparable to metadata creation and administration, provenance/lineage monitoring, and information synchronization) which can be important for information high quality and good information governance.

In different phrases, the sheer preponderance of information sources isn’t a bug: it’s a characteristic. If historical past is any indication, it’s an issue that isn’t going to go away: a number of, redundant, typically inconsistent copies of helpful information units will at all times be with us.

On the great aspect, technological progress—e.g., front-end instruments that generate metadata and seize provenance and lineage; information cataloging software program that manages provenance and lineage—may tamp down on this. So, too, may cultural transformation: e.g., a top-down push to teach individuals about information high quality, information governance, and common information literacy.

Organizations are flunking Information Governance 101

Some frequent information high quality points level to bigger, institutional issues. “Disorganized information shops and lack of metadata” is essentially a governance challenge. However simply 20% of survey respondents say their organizations publish details about information provenance or information lineage, which—together with strong metadata—are important instruments for diagnosing and resolving information high quality points. If the administration of information provenance/lineage is taken as a proxy for good governance, few organizations are making the reduce. Neither is it stunning that so many respondents additionally cite unlabeled/poorly labeled information as an issue. You’ll be able to’t pretend good governance.

Nonetheless one other laborious downside is that of “Poor information quality control at information entry.” Anybody who has labored with information is aware of that information entry points are persistent and endemic, if not intractable.

Another frequent information high quality points (Determine 4)—e.g., poor information high quality from third-party sources (cited by about 36% of respondents), lacking information (about 37%), and unstructured information (greater than 40%)—are much less insidious, however no much less irritating. Practitioners could have little or no management over suppliers of third-party information. Lacking information will at all times be with us—as will an institutional reluctance to make it entire. As for an absence of assets (cited by greater than 40% of respondents), there’s at the very least some purpose for hope: machine studying (ML) and synthetic intelligence (AI) may present a little bit of a lift. Information engineering and information evaluation instruments use ML to simplify and substantively automate a few of the duties concerned in discovering, profiling, and indexing information.

Not surprisingly, nearly half (48%) of respondents say they use information evaluation, machine studying, or AI instruments to handle information high quality points. A deeper dive (Determine 5) offers an fascinating take: organizations which have devoted information high quality groups use analytic and AI instruments at a better fee, 59% in comparison with the 42% of respondents from organizations with no devoted information high quality crew. Having a crew targeted on information high quality can present the house and motivation to put money into attempting and studying instruments that make the crew extra productive. Few information analysts or information engineers have the time or capability to make that dedication, as a substitute counting on advert hoc strategies to handle the information high quality points they face.

Figure 5: Data quality survey '20. Effect of dedicated data quality team on using AI tools. 
Determine 5. Impact of devoted information high quality crew on utilizing AI instruments.

That being stated, information high quality, like information governance, is essentially a socio-technical downside. ML and AI will help to an extent, nevertheless it’s incumbent on the group itself to make the required individuals and course of modifications. In any case, individuals and processes are nearly at all times implicated in each the creation and the perpetuation of information high quality points. Finally, diagnosing and resolving information high quality issues requires a real dedication to governance.

Information conditioning is pricey and useful resource intensive (and decidedly not attractive), one of many causes we don’t see extra formal help for information high quality amongst respondents. To extend the concentrate on resolving information points requires rigorously scrutinizing the ROI of information conditioning efforts to concentrate on essentially the most worthwhile, productive, and efficient efforts.

Biases, damned biases, and lacking information

Just below 20% of respondents cited “Biased information” as a main information high quality challenge (Determine 4). We regularly speak about the necessity to deal with bias and equity in information. However right here the proof means that respondents see bias as much less problematic than different frequent information high quality points. Do they know one thing we don’t? Or are respondents themselves biased—on this case, by what they’ll’t think about? This outcome underscores the significance of acknowledging that information accommodates biases; that we must always assume (not rule out) the existence of unknown biases; and that we must always promote the event of formal variety (cognitive, cultural, socio-economic, bodily, background, and so forth.) and processes to detect, acknowledge, and deal with these biases.

Lacking information performs into this, too. It isn’t simply that we lack the information we consider we want for the work we wish to do. Typically we don’t know or can’t think about what information we want. A textbook instance of this comes through Abraham Wald’s evaluation of the way to enhance the position of armor on World Warfare II-era bombers: Wald wished to review the bombers that had been shot down, which was virtually unattainable. Nonetheless, he was in a position to make inferences in regards to the impact of what’s now referred to as survivor bias by factoring in what was lacking, i.e., that the planes that returned from profitable missions had an inverse sample of harm relative to those who had been shot down. His perception was a corrective to the collective bias of the Military’s Statistical Analysis Group (SRG). The SRG couldn’t think about that it was lacking information.

No information high quality challenge is an island complete of itself

Organizations aren’t coping with just one information high quality challenge. It’s extra sophisticated than that—with greater than half of respondents reporting at the very least 4 information high quality points.

Determine 6, under, combines two issues. The darkish inexperienced portion of every horizontal bar exhibits the share of survey respondents who reported that particular variety of discrete information high quality points at their organizations  (i.e., 3 points or 4 points, and so forth.). The sunshine grey/inexperienced portion of every bar exhibits the combination share of respondents who reported at the very least that variety of information high quality points (i.e., at the very least 2 points, at the very least 3 points, and so forth.).

Just a few highlights to assist navigate this complicated chart:

  • Respondents most frequently report both three or 4 information high quality points. The darkish inexperienced portion of the horizontal bars present about 16% of respondents for every of those outcomes.
  • Wanting on the aggregates of the “at the very least 4” and “at the very least 3” gadgets, we see the sunshine grey/inexperienced part of the chart exhibits 56% of respondents reporting at the very least 4 information high quality points and 71% reporting at the very least three information high quality points.

That organizations face myriad information high quality points isn’t a shock. What’s stunning is that organizations don’t extra usually take a structured or formal method to addressing their very own distinctive, gnarly mixture of information high quality challenges.

Number of data quality issues reported
Determine 6. Variety of information high quality points reported.

Lineage and provenance proceed to lag

A major majority of respondents—nearly 80%—say their organizations don’t publish details about information provenance or information lineage.

If that is stunning, it shouldn’t be. Lineage and provenance are inextricably sure with information governance, which overlaps considerably with information high quality. As we noticed, most organizations are failing Information Governance 101. Information scientists, information engineers, software program builders, and different technologists use provenance information to confirm the output of a workflow or information processing pipeline—or, as usually as not, to diagnose issues. Provenance notes the place the information in an information set got here from; which transformations, if any, have been utilized to it; and different technical trivialities.

With respect to enterprise intelligence and analytics, information lineage offers a mechanism enterprise individuals, analysts, and auditors can use to belief and confirm information. If an auditor has questions in regards to the values in a report or the contents of an information set, they’ll use the information lineage file to retrace its historical past. On this method, provenance and lineage give us confidence that the content material of an information set is each explicable and reproducible.

Data provenance and lineage tools 
Determine 7. Information provenance and lineage instruments.

Of the 19% of survey respondents whose organizations do handle lineage and provenance, barely lower than 30% say they use a model management system—a la Git—to do that (Determine 7). One other one-fifth use a pocket book surroundings (comparable to Jupyter). The remaining 50% (i.e., of respondents whose organizations do publish lineage and provenance) use a smattering of open supply and business libraries and instruments, most of that are mechanisms for managing provenance, not lineage.

If provenance and lineage are so vital, why do few organizations publish details about them?

As a result of lineage, particularly, is difficult. It imposes entry and use constraints that make it harder for enterprise individuals to do what they need with information—particularly as regards sharing and/or altering it. First-generation self-service analytic instruments made it simpler—and, in some instances, attainable—for individuals to share and experiment with information. However the ease-of-use and company that these instruments promoted got here at a worth: first-gen self-service instruments eschewed information lineage, metadata administration, and different, comparable mechanisms.

A finest observe for capturing information lineage is to include mechanisms for producing and managing metadata—together with lineage metadata—into front- and back-end instruments. ETL instruments are a textbook instance of this: nearly all ETL instruments generate granular (“technical”) lineage information. Till lately, nevertheless, most self-service instruments lacked wealthy metadata administration options or capabilities.

This may clarify why practically two-thirds of respondents whose organizations do not publish provenance and lineage answered “No” to the follow-up query: “Does your group plan on implementing instruments or processes to publish information provenance and lineage?” For the overwhelming majority of organizations, provenance and lineage is a dream deferred (Determine 8).

Plans for publishing data provenance and lineage 
Determine 8. Plans for publishing information provenance and lineage.

The excellent news is that the pendulum might be swinging within the route of governance.

Barely greater than one-fifth chosen “Inside the subsequent yr” in response to this query, whereas about one-sixth answered “Past subsequent yr.” Hottest open supply programming and analytic environments (Jupyter Notebooks, the R surroundings, even Linux itself) help information provenance through built-in or third-party tasks and libraries. Industrial information evaluation instruments now supply more and more strong metadata administration options. In the identical method, information catalog distributors, too, are making metadata administration—with an emphasis on information lineage—a precedence. In the meantime, the Linux Basis sponsors Egeria, an open supply customary for metadata administration and change.

Information high quality isn’t a crew effort

Based mostly on suggestions from respondents, comparatively few organizations have created devoted information high quality groups (Determine 9). Most (70%) answered “No” to the query “Does your group have a devoted information high quality crew?”

Presence of dedicated data quality teams in organizations
Determine 9. Presence of devoted information high quality groups in organizations.

Few respondents who answered “Sure” to this query truly work on their group’s devoted information high quality crew. Almost two-thirds (62%) answered “No” to the follow-up query “Do you’re employed on the devoted information high quality crew?”; simply 38% answered “Sure.” Solely respondents who answered “Sure” to the query “Does your group have a devoted information high quality crew?” had been permitted to reply the follow-up. All advised, 12% of all survey respondents work on a devoted information high quality crew.

Actual-time information on the rise

Relatedly, we requested respondents who work in organizations that do have devoted information high quality groups if these groups additionally work with real-time information.

Virtually two-thirds (about 61%) answered “Sure.” We all know from different analysis that organizations are prioritizing efforts to do extra with real-time information. In our current evaluation of Strata Convention audio system’ proposals, for instance, phrases that correlate with real-time use instances had been entrenched within the first tier of proposal matters. “Stream” was the No. 4 total time period; “Apache Kafka,” a stream-processing platform, was No. 17; and “real-time” itself sat at No. 44.

“Streaming” isn’t equivalent with “real-time,” in fact. However there is proof for overlap between using stream-processing applied sciences and so-called “real-time” use instances. Equally, the rise of next-gen architectural regimes (comparable to microservices structure) can be driving demand for real-time information: A microservices structure consists of tons of, hundreds, or tens of hundreds of providers, every of which generates logging and diagnostic information in real-time. Architects and software program engineers are constructing observability—principally, monitoring on steroids—into these next-gen apps to make it simpler to diagnose and repair issues. This can be a compound real-time information and real-time analytics downside.

The world isn’t a monolith

For essentially the most half, organizations in North America appear to be coping with the identical issues as their counterparts in different areas. Trade illustration, job roles, employment expertise, and different indicia had been surprisingly constant throughout all areas—though there have been a couple of intriguing variances. For instance, the proportion of “administrators/vice presidents” was about one-third increased for North American respondents than for the remainder of the world, whereas the North American proportion of consulting/skilled providers respondents was near half the tally for the remainder of the globe.

Our evaluation surfaced at the very least one different intriguing geographical variance. As famous in Determine 9, we requested every participant if their group maintains a devoted information high quality crew. Whereas North America and the remainder of the world had about the identical share of respondents with devoted information high quality groups, our North American respondents had been much less prone to work on that information high quality crew.


A overview of the survey outcomes yields a couple of takeaways organizations can use to realistically deal with how they’ll situation their information to enhance the efficacy of their analytics and fashions.

  • Most organizations ought to take formal steps to situation and enhance their information, comparable to creating devoted information high quality groups. However conditioning information is an on-going course of, not a one-and-done panacea. Because of this C-suite buy-in—as troublesome as it’s to acquire—is a prerequisite for sustained information high quality remediation. Selling C-suite understanding and dedication could require training as many execs have little or no expertise working with information or analytics.
  • Conditioning is neither simple nor low-cost. Committing to formal processes and devoted groups helps set expectations in regards to the troublesome work of remediating information points. Excessive prices ought to compel organizations to take an ROI-based method to how and the place to deploy their information conditioning assets. This consists of deciding what isn’t value addressing.
  • Organizations that pursue AI initiatives normally uncover that they’ve information high quality points hiding in plain sight. The issue (and partial answer) is that they want high quality information to energy their AI tasks. Consider AI because the carrot, and of poor information because the proverbial stick. The upshot is that funding in AI can develop into a catalyst for information high quality remediation.
  • AI is a solution, however not the one one. AI-enriched instruments can enhance productiveness and simplify a lot of the work concerned in bettering information efficacy. However our survey outcomes recommend {that a} devoted information high quality crew additionally helps to foster using AI-enriched instruments. What’s extra, a devoted crew is motivated to put money into studying to make use of these instruments properly; conversely, few analysts and information engineers have the time or capability to totally grasp these instruments.
  • Information governance is all properly and good, however organizations want to begin with extra primary stuff: information dictionaries to assist clarify information; monitoring provenance, lineage, recency; and different necessities.


Please enter your comment!
Please enter your name here