Big Data: What about Quality?

Oct 2, 20237 min read

Nowadays, the discussion of data is inseparable from acknowledging the exponential rate at which it is being generated. This surge in data production can be attributed to various factors, including the proliferation of data-generating devices, the emergence of novel data formats, the increased preservation of intricate details for analysis, and the heightened demands for security and compliance. According to a study conducted by the Statista firm, global data is projected to surge from 79 zettabytes in 2021 to 180 zettabytes (equivalent to 180 billion terabytes) by 2025. This phenomenon led to the inception of "Big Data" as a new technology capable of managing, processing, and navigating the explosive growth of data, all while presenting fresh avenues for data utilization.

While "Big Data" refers to the large volume of data, it's important to mention that volume is probably the most significant property of Big Data but certainly not the only one. Indeed, when discussing Big Data, we also refer to factors such as variety, velocity, veracity, and others. Initially, Big Data was characterized by three properties or three Vs (Volume, Variety, and Velocity). However, in recent years, additional Vs have come into play. Today, we recognize more than 50 Vs in this context. The figure below illustrates the most prevalent attributes, commonly called the "8 Vs of Big Data."

While it's clear that businesses can benefit from this data growth, leaders must be cautious and aware of the challenges they will face, especially regarding data quality. Not only are poor-quality data unusable, but they also undermine the reliability of analysis and, consequently, the results obtained. Thus, regardless of the application domain of your data, data quality is a crucial aspect. In this regard, developing an end-to-end data quality strategy that supports data throughout its lifecycle is vital. This solution will enable comprehensive and automated data management, primarily based on artificial intelligence and, more specifically, predictive analysis. To achieve this, we will discuss data quality and the value chain in this post, which are vital elements in Big Data management. We will also present the most comprehensive data quality aspects with 12 quality metrics. Furthermore, we will explain metadata quality in a Big Data context, which has received less attention despite its significant impact on data governance.

The Big Data Value Chain (BDVC)

Given the unique characteristics of Big Data, data collected in a Big Data environment cannot be directly exploited and require special processing that meets the requirements of Big Data and transforms unstructured and raw data into actionable data. Thus, the Big Data Value Chain (BDVC) was introduced as a process that describes the various stages data go through within a Big Data environment. Over the years, the BDVC has incorporated more phases. The figure below presents the most recent and comprehensive BDVC, defined in 7 stages:

Data Generation: This first stage refers to the data production process. The generated data can be structured, semi-structured, or unstructured, depending on the source type (generated by humans, processes, or machines).

Data Acquisition: This stage involves collecting data from various sources for processing.

Data Preprocessing: The preprocessing phase is a crucial step in the BDVC. It involves cleaning the data collected in the previous stage of inconsistencies and inaccurate values, thus transforming it into a clean and usable format. This phase significantly improves data quality through transformation, integration, cleaning, and data reduction.

Data Storage: A clean copy of the data is stored for future use after the preprocessing phase.

Data Analysis: This phase involves analyzing and manipulating previously cleaned data to identify unknown correlations and patterns and transform the data into actionable knowledge. Various Big Data analysis techniques, such as machine learning, deep learning, and data exploration, can be used in this phase.

Data Visualization: In this phase, the analysis results are visualized in a readable format, making it easier to understand using visual elements such as graphs, maps, and dashboards.

Data Exposition: This final stage involves sharing and exposing the information and value generated throughout each phase of the BDVC. The extracted information could be used to sell specific services, shared with the public as open data, or shared internally to improve organizational performance.

Indeed, adopting the value chain has allowed for the definition and implementation of strategies and processes that ensure the proper routing, storage, and processing of data throughout their lifecycle. However, this cannot guarantee the accuracy and quality of data, a crucial factor for the company's strategic decisions.

Big Data Quality

In a Big Data project, data quality management is a critical component. Indeed, data collected in a Big Data environment is typically unstructured and poorly quality. Thus, using such data types impacts the reliability of the analysis results, affecting decision-making. According to Gartner's analysis, more than 25% of critical data in large enterprises is incorrect. Therefore, with unreliable, inconsistent, redundant, poorly defined, and even more extensive data, companies face the challenge and necessity of improving data quality.

Data quality can be defined along several dimensions: accuracy, completeness, consistency, freshness, uniqueness, validity, and more. Each data quality dimension can be quantified using metrics. Over the years, the number of defined dimensions has increased to encompass more than 50 dimensions. However, the measured dimensions do not exceed 8. This significant gap between defined and measured dimensions highlights the imminent need to transform quality dimensions into measurable quantities to be exploited and implemented.

In response to this need, big data quality frameworks aim at evaluating data quality in a Big Data context by extending the measured dimensions to quality metrics, including Completeness, Currency, Volatility, Uniqueness, Compliance, Consistency, Ease of Manipulation, Relevance, Readability, Security, Accessibility, and Integrity. Many measurement formulas for each quality dimension could be used to translate its definition into a quantifiable metric. For instance, to measure dataset completeness, we calculate the rate of missing values over the total values as follows:

Completeness(%)= (Number of non empty values /Total values)*100

Similarly, we could evaluate other metrics such as Currency, which corresponds to the time elapsed between the current date and the last modification date, or Ease of Manipulation, which corresponds to the ratio of differences between raw and preprocessed data:

Timeliness(%)= (Current Date-Last Modification Date /Current Date-Creation Date)*100

Ease of Manipulation (%)= (Number of differences between original and cleaned table /Total data)*100

Following the same approach, it is possible to define explicit measurement formulas for the all-big data quality metrics, which are implemented and reused to assess the five aspects of Quality: Reliability, Availability, Usability, Compliance, and Validity. In the following, it is a grouping of 12 metrics into the five quality aspects:

This classification aimed not only to encapsulate metrics with similar properties but also to provide a better understanding of the overall aspects of data quality.

Big data quality frameworks also contribute to the accuracy and reliability of measurements. It allows for generating a weighted quality score considering the importance of information, metrics, and aspects. Indeed, the information in the data is not equally important from a business perspective; some data is more critical than others. Therefore, relevant data should have a more significant impact on data quality measurements. Thus, to ensure high precision and reliability of measures, it is possible to rely on weighted scoring of data quality on three levels:

At the field level, necessary fields containing the most critical information will significantly impact data quality scores.

At the level of quality metrics: quality metrics that matter most to the organization according to its governance strategy will have a more significant impact on data quality scores.

At the level of quality aspects, the most crucial general quality aspects for the company will significantly impact data quality scores.

Metadata Quality for Big Data

Metadata is data that describes data, providing relevant information about data such as resource type, source, creation date, size, usage, and more. In a Big Data environment, metadata is also used to store information related to data preprocessing and cleaning, analysis results, quality measurements, and technical details about data storage and processing. Therefore, data quality cannot be discussed without considering metadata. Indeed, metadata management provides valuable information about data and enables cost-effective and rapid control. Nowadays, Big Data tools primarily rely on metadata for data storage and processing.

Other uses for metadata have also emerged, such as data identification and discovery, data quality assessment, content tracking and monitoring, and more. However, metadata management cannot be effective without considering metadata quality. In Big Data environments, collected metadata are unstructured and heterogeneous.

Moreover, collected metadata may contain anomalies such as empty fields, duplicate records, or inconsistent values. Despite the impact of metadata on data quality, metadata quality is often overlooked in Big Data processing approaches. This is why it is essential to use a framework for improving metadata quality throughout their processing in Big Data environments. This is called the "Metadata Value Chain" and refers to the different stages through which metadata passes in a Big Data context, as presented below:

The first phase is the collection phase, which involves retrieving metadata information from various repositories. Once recovered, metadata is augmented with additional information related to the collection process. Thus, Big Data ingestion tools could be used for metadata collection at this stage, such as Open Calais or Apache Atlas.

Once collected and extracted, metadata are stored in a data lake in a separate database called the "Metadata Repository." For unstructured data, Hadoop Distributed File System (HDFS) is widely used for data and metadata storage. In HDFS, metadata is stored separately in NameNodes and linked to data nodes (data nodes) where data is stored.

The third step is metadata provisioning, which involves making stored metadata available for internal use, such as data analysis or visualization. Metadata could also be published for external users or consuming applications using specialized APIs, such as data portals with open access.

In a Big Data context, stored metadata are not static and are continually fed and updated with each data processing. Therefore, a final maintenance step is necessary to ensure data operability.

It's essential to consider quality throughout the metadata processing process to ensure good metadata quality. Thus, by identifying the issues encountered in each phase of this process, we present the following model, which defines quality metrics to address these issues:

Mapping dimensions into the Metadata Value Chain

The metadata quality improvement is significant for developing a comprehensive data governance strategy that considers all data-related elements, such as data policy, process management, risk management, and data quality.

The adopted data governance framework should also define organizational structures and roles such as owners, managers, or stewards. Thus, with a sound governance strategy where data quality is integral to every business process, the organization can fully leverage data and extract value. While data specialists do their best to propose data quality improvement solutions, the governance strategy cannot be managed by multiple separate systems, hence the need for a complete end-to-end data management solution that supports data throughout its lifecycle.

In Conclusion

Analysis of Big Data cannot be of great value if the data are of poor quality. Since the extracted data is generally unstructured, an assessment of data quality is required during Big Data processing. In this post, we reviewed the properties of Big Data, its value chain, and the most common quality dimensions. A comprehensive data quality framework for measuring 12 quality metrics was presented. As most Big Data tools and processes heavily rely on metadata, metadata quality must be considered when developing a data management strategy. In this regard, we explained how metadata quality dimensions can be applied to the metadata management process.

Big Data: What about Quality?

Recent Posts

Comentarios

Subscribe to our newsletter