The modern data stack (MDS)

What is a modern MDS?

The modern data stack (MDS) is a combination of technologies and tools that businesses use to store, process, and analyze data in the digital age. With the proliferation of data-driven decision-making, it is important for organizations to have a robust and efficient data stack to derive insights and drive innovation, therefore organizations can make informed decisions, improve their operations, and drive growth.

The importance of MDS transcends markets and industries, such as eCommerce. With a vast amount of data on customer behavior, purchase history, and demographics, an e-commerce company can use a modern data stack to personalize the shopping experience for each individual customer. This can lead to increased customer satisfaction and loyalty, ultimately resulting in higher sales and revenue.

One can also look at the healthcare industry for further examples of the importance of the modern data stack. MDS can help healthcare organizations to understand and predict disease outbreaks, prioritize and allocate resources, and improve patient outcomes. By analyzing healthcare data using a modern data stack, organizations can identify trends and patterns that may not be immediately apparent, leading to more effective treatment and prevention strategies.

A typical modern data stack consists of the following parts that typically work in synergy to create one holistic experience:

Data Collection

This is the first step in the data management process, where data is collected from various sources such as external platforms, web servers, mobile devices, and IoT sensors. This step allows companies to gather information about their customers, products, and operations and create a conduit for data flows that regularly refresh the data.

One of the primary benefits of data collection is the ability to retain data for downstream processing and reprocessing. Once the company collects its data, this data is retained by the company for an indefinite amount of time and can be reused again if needed. This is why it is important to retain detailed raw data.

For example, a company that sells clothing online may collect data on what types of products are most popular with customers, as well as their browsing and purchasing habits. This information can help the company tailor its marketing efforts and improve its product offerings. It can also be used in machine learning and data science models and can be reused for years to come to improve model accuracy.

Detailed data collection is also important for optimizing business operations. By combining data on production processes and supply chain management, companies can identify bottlenecks and inefficiencies, leading to cost savings and improved efficiency. For example, a manufacturer may collect data on machine usage and maintenance schedules to identify opportunities for automation and streamlining of processes.

Typically, there are 3 types of collected data. The collected data may be

Structured
Semi-structured
Unstructured

Therefore, it is essential to have a data management system in place to handle the ingestion and transformation of the kind of data structures that the company is generating and consuming. Tools like Apache Flume and Apache Kafka are commonly used for the purpose of data collection.

On a more granular level, the process of Data Collection has a few important components that make it work reliably and ensure consistent and accurate delivery of data:

Data Source Connector
Data Destination Connector
Scheduling Mechanism
Monitoring and Logging

For example, a company may want to connect to Facebook, extract the advertising data, and send this data to the Postgres database. Alternatively, the company may want to send the same data to a MySQL database. For each case, they will need a different destination connector and for all cases, the company may want to automate the data extraction with the scheduler to run at a specific time while logging how much data was extracted and delivered to each destination.

Data Storage

Once the data has been collected, it needs to be stored in a scalable and reliable data repository. A data store or data warehouse is a centralized data storage medium that allows companies to save, manage, and analyze large volumes of data. It is an essential component of the modern data stack, as it enables businesses to retain their most valuable assets over long periods of time.

Benefits of data storage

One of the primary benefits is the singularity of the location to store and manage large amounts of data. This is especially important for businesses that generate vast amounts of data from a large multitude of sources. Without a centralized data store, it would be very difficult to manage the pieces of data across a plethora of heterogeneous systems that all behave in different ways.

Another benefit of data stores is their ability to support querying and analysis. Data stores typically provide tools and interfaces that allow users to run queries and perform analysis on the data, enabling them to gain insights and make data-driven decisions. For example, a company that operates a fleet of delivery vehicles may use a data store to track and analyze data on vehicle maintenance, fuel consumption, and delivery routes. This information can be used to optimize routes, reduce fuel costs, and improve the overall efficiency of the fleet.

Data Storage Options

There are many different data storage options available, each with its own set of features and capabilities.

Relational database management systems (RDBMS) (e.g., MySQL, PostgreSQL): These are traditional database systems that use a structured table format to store data. Examples include MySQL, Oracle, and Microsoft SQL Server. RDBMS are widely used because they are relatively easy to set up and maintain, and they provide robust querying and analysis capabilities. However, they can be inflexible when it comes to storing and processing unstructured data.

NoSQL databases (e.g., MongoDB, Cassandra): These are non-relational database systems that are designed to handle large volumes of unstructured data. Examples include MongoDB, Cassandra, and DynamoDB. NoSQL databases are often used for applications that require fast read and write performance, such as real-time analytics and internet-scale applications. However, they may not offer the same level of querying and analysis capabilities as RDBMS.

Hadoop: This is an open-source framework for distributed storage and processing of large data sets. It consists of a distributed file system (HDFS) and a parallel processing engine (MapReduce). Hadoop is often used for big data analytics and machine learning, as it can handle large volumes of data and handle complex data processing tasks. However, it can be challenging to set up and maintain, and it may not be suitable for real-time data processing.

Cloud data stores (e.g., Amazon S3, Google Cloud Storage): These are data storage systems that are hosted in the cloud and accessed over the internet. Examples include Amazon S3, Google Cloud Storage, and Microsoft Azure Storage. Cloud data stores are attractive to businesses because they are easy to set up and scale, and they offer high availability and durability. However, they can be more expensive than on-premises solutions, and they may not offer the same level of control over data security and compliance.

Data Lakes These are centralized repositories that store structured and unstructured data at a large scale. They are designed to support big data analytics and machine learning, and they typically use distributed file systems, such as HDFS, to store data. Data lakes are attractive to businesses because they offer flexibility and scalability, and they can handle a wide variety of data types and structures. However, they can be complex to set up and maintain, and they may require specialized skills and tools to analyze the data.

The choice of storage option depends on the size, complexity, and type of data being stored.

Data Processing

After the data has been collected and stored in some data storage, the data needs to be processed to extract insights and generate reports. Data transformation is a critical step in the modern data stack as it allows companies to clean, standardize, and reshape data in a way that is more useful and meaningful for various business purposes. In today’s data-driven world, companies rely on large amounts of data from a variety of sources to inform their decision-making and drive their business operations. The data transformation often starts after the data delivery (often called data collection or data integration) and may involve tasks such as

Filtering
Aggregating
Deduplicating
Cleaning
Pivoting
Transforming

The data that is delivered from data sources is often raw and unstructured, making it difficult to extract insights and value from it.

Data transformation helps to address this issue by providing a way to manipulate and reorganize data in a more meaningful and useful way. For example, a company may receive data from multiple sources, such as customer databases, social media platforms, and sensor networks. This data may be in different formats and may not be organized in a way that is easily understandable or comparable. Data transformation processes can be used to standardize the data, such as by converting it to a common format or structure, and to clean it, such as by removing duplicates or inconsistencies.

In addition to standardization and cleaning, data transformation also allows companies to reshape data in a way that is more useful for specific business purposes. For example, a company may want to aggregate data from multiple sources to get a more comprehensive view of its customers. Data transformation processes can be used to group and combine data from different sources in a way that enables the company to gain insights into customer behavior and preferences. This is where KPI Libraries become important. A KPI library will help a company understand where and how the data need to be aggregated.

A company may want to integrate data from its customer database with data from its financial systems in order to better understand its customers’ purchasing habits. Data transformation processes can be used to match and link data from different sources in a way that enables the company to gain insights into customer behavior and preferences.

Overall, data transformation is another important step in the modern data stack that allows companies to clean, standardize, and reshape data in a way that is more useful and meaningful for various business purposes. By enabling companies to extract insights and value from their data, data transformation is a key driver of data-driven decision-making and business operations in the modern world.

Apache Hadoop and Apache Spark are popular tools for distributed data processing, while SQL is a common choice for querying and manipulating data stored in relational databases.

Data Visualization

Once the data has been processed, it is vital to present it in a way that is easy to understand and interpret. Data visualization is the representation of data in graphical form. It aims to simplify insights and make them easy to understand for the users. From a technical perspective, it is important to choose the right tool, type of chart, style, and layout for the data and make it easy for the end user to understand the story being told. What is even more important to understand are.

The Purpose of a visual or dashboard (a combination of related visual elements)

What requirements do business users have
What questions do they expect to find answers to
What level of detail should be in the visual

All such questions need to be discussed to create meaningful and insightful visuals, rather than a good-looking but not helpful chart.

Data visualization is a form of communication that portrays raw data in a neat, appealing graphical form that makes it easier to extract meaningful information. To create a good visual, the main principles need to be followed.

Accuracy

A visual must represent correct and full information. Misleading visuals are not acceptable.

Standardization

Often, visuals are not a single item, but a set of visuals or dashboards, and they can be a part of boards, workbooks, stories, etc. All these should have a standard form, if possible, in terms of design, data structures, layouts, etc.

Convenience

A visual must be user-friendly in terms of style, layout, readability, etc.

Interactivity

In most cases, interactivity allows the user to achieve more insights from the visual, and look at it from different perspectives.

Scalability

A visual must have the ability to accommodate increasing volumes of data and data sources so that reports can be kept up to date.

Simplicity

A visual must include only those elements that are required for the report and that bring value to it.

Most of these principles must not be compromised, but some can be omitted for specific use cases. For example, if a visual will be used only as a static view (maybe as part of a presentation), the “interactivity” principle can be omitted.

Data visualization tools such as Tableau, Qlik, and D3.js enable businesses to create interactive charts, graphs and maps to communicate data-driven insights.

Data Governance

As businesses increasingly rely on data to drive decision-making, it is paramount to have a system in place to ensure the integrity, security, and accessibility of the data. Data governance frameworks such as the Data Governance Institute’s Data Governance Maturity Model (DGMM) provide a set of best practices and guidelines for establishing and maintaining data governance within an organization.

Generally speaking, data governance is the process of managing, controlling, and monitoring the collection, storage, use, and dissemination of data within an organization. It is an essential aspect of data management, as it ensures that data is accurate, consistent, and compliant with relevant laws and regulations.

One of the primary reasons why data governance is important for organizations is that it helps ensure the integrity and quality of the data. Poorly managed data can lead to inaccuracies, inconsistencies, and errors, which can negatively impact the reliability of the data and the decisions made from it. By implementing data governance policies and procedures, organizations can ensure that data is accurate and consistent and that any errors are identified and corrected in a timely manner.

Another important aspect of data governance is compliance with laws and regulations. Organizations are subject to a variety of laws and regulations that govern how they collect, store, and use data. These include laws such as the General Data Protection Regulation (GDPR) in Europe and the Health Insurance Portability and Accountability Act (HIPAA) in the United States. By implementing data governance practices, organizations can ensure that they are compliant with these laws and regulations and that they are taking appropriate steps to protect the privacy and security of their data.

There are many examples of how companies can use data governance to improve their operations. For example, a healthcare company may use data governance to ensure that patient data is accurate and up-to-date, and that it is protected against unauthorized access. This can help the company provide better care to patients, and it can reduce the risk of errors or breaches that could harm patients or damage the company’s reputation.

Another example is a retail company using data governance to ensure that customer data is used in a way that complies with relevant laws and regulations. This can help the company avoid legal trouble while maintaining the data needed to create more targeted marketing campaigns Using the governed data the company can improve customer relationships by providing better service and more personalized offers.

Data governance can bring many benefits to the companies that implement it. By ensuring that data is accurate, consistent, and compliant with relevant laws and regulations, data governance can help improve the reliability of the data and the decisions made from it. It also can help ensure that data is protected against unauthorized access, which can reduce the risk of errors, breaches, and reputational harm. Ultimately, data governance can help organizations make better decisions, gain a competitive edge and operate more efficiently.

Conclusion

A modern data stack is a critical component of data-driven decision-making. By carefully selecting and implementing the right technologies and tools at each layer of the stack, businesses can effectively collect, store, process, visualize, and govern their data to drive innovation and derive insights.