Article
Data Engineering for Streaming Data - How to ensure data quality
The term data pipeline has become very popular in today's data-driven landscape. Its popularity arises from its role in enabling data-driven decision-making, fostering real-time insights, and enhancing business operations.
In this article, we will delve into the crucial topic of ensuring data quality within event-driven data pipelines. Additionally, we'll take a closer look at the advantages offered by Crosser's low-code platform tailored for handling streaming data seamlessly, from the edge to cloud.
What is a Data Pipeline?
A data pipeline is a way of describing a step by step processing of data where you have a sequence of processing stages. A common data pipeline scenario is when you first collect the data, then process the data and finally deliver the processed data or insights.
How does this work in practice:
-
Independent tasks: Each of these steps performs a specific task and processes individual chunks of data, also known as events, messages, files or batches. Then the results will be sent to the next step.
-
Streaming data pipelines: This is when you constantly feed the pipeline with new data. It could be time series data from sensors, video files, repetitive calls to a database, and much more.
The key thing with this kind of pipeline is that as soon as one of these processing steps is finished then it is ready to process the next chunk of data. You can then constantly feed the pipeline with new data and all these steps will operate in parallel and process different chunks of data. This is how you get a streaming pipeline. Data can be ingested from multiple sources by having more than one ingest task.
-
Conditional processing: Data can be sent through different paths in the pipeline, from input to output, depending on the values received, the source of the data or results of the processing in previous tasks.
Why are Data Pipelines a popular concept when implementing data processing?
Some of the most common reasons why this concept has become so popular are:
-
Simplicity: It is a natural way of breaking up a complex problem into smaller pieces that are easy to manage or implement and visualize. It helps to better understand and explain what each use-case or pipeline is doing.
-
Reusability: when you start working with pipelines you will realize that many operations will recur. So while building your pipelines you are also creating a library of pre-built tasks that are repetitive and you can reuse in different use cases. It’ll save you a lot of time as you don’t need to start from scratch each time you are building a new pipeline.
-
Scalability: As you have broken the process into smaller tasks, it is easier to build up and scale each task individually inside the pipeline. You don’t need to scale the whole process from end to end, just that specific task in the processing stage.
For example if you want to perform “heavy” processing, like running Machine Learning Models in the processing stage that requires a lot of resources. You can easily pause that part and run multiple instances in parallel and you don't need to affect the input and output tasks to cope with that. As long as you have a consistent capacity throughout the pipeline then you can scale individual steps as needed.
Data Pipelines with Crosser
With Crosser, this is exactly the kind of scenarios you can build with your data pipelines and furthermore, you can execute them in different, distributed, environments. This is one of the key differentiators when comparing Crosser with other solutions available in the market.
-
Build pipelines with ease: The Crosser platform consists of a rich library of pre-built modules and connectors, available to implement commonly used tasks. By choosing the modules and/or connectors you want to use, and connecting them in the design tool, you will get your streaming data pipelines ready to use. There is no need to write code, just drag-and-drop and configure accordingly.
The designing and development of your pipeline is performed in Crosser’s cloud service called the Control Center. You can also test run the pipeline directly in the Control Center, either with generated test data - or with real data.
There might be use cases where you need to write your own code, or use third party snippets for performing your designated processing. For this reason, Crosser is an open tool where you easily can extend the platform capabilities with your own custom functionality.
Options available:
-
Code Modules: Support for C#, Python and JavaScript code that you can mix with the native modules provided by Crosser.
-
SQL Executer Modules: Write your own SQL statements and queries, when the no-code Select and Insert modules are not sufficient.
-
Bring your own ML/AI: Run any Python ML/AI model in Crosser.
-
Universal Connector: Build connectors for any REST-API using a five-step wizard.
-
Open SDK: Build your own custom modules using C#.
Explore the module Library here
The execution environment (runtime) for these flows or pipelines is called the Crosser Node. This is a generic real-time engine running in a Docker container that can be installed anywhere. Designed to fit the set-up for your specific use case - on-premise, cloud or edge. This gives you full flexibility in the type of architecture you want to build, just mix and match at your convenience.
Crosser gives you an all-in-one solution that covers multiple use cases - start by collecting data from any source and process the data inside the pipelines. Everything built with full flexibility to choose the architecture that works best for you.
Data Transformation and Validation
Take the data you get, and create the data you need
Data transformation is the most common operation you will perform in your data flows. The main reason for this is that data rarely arrives in the format that you want or need. The most common data transformation operations are:
-
Interpret encoding (CSV, JSON, XML…) - convert specific data into something you can start processing
-
Change the structure or the naming convention being used
-
Harmonize naming
-
Detect/remove outliers and missing values
-
Change units (C → F, mBar → psi…)
-
Time alignment (sampling periods and sampling time on multi-source data)
Transformation is when you know what you should expect as input data and based on those expectations you build data transformation to convert it into what you want.
The data you get, doesn’t meet with what data you expect?
Data Validation is about making sure that the data you receive is meeting with the expectations you might have built into your processing.
-
Detect deviations from expected characteristics, so that you can take actions
-
Typical data validation operations:
-
Values outside expected ranges/categories
-
New or missing structural elements, e.g. a new column is received from a database query or expected metadata is missing
-
The encoding is invalid, so the data cannot be interpreted/used
-
The statistical properties has changed
Many of these problems will be detected automatically and create errors in your pipeline. In order to be able to continue with the processing you need to make sure they don’t enter the following steps in the pipeline where they will create issues.
However, there are some scenarios where errors will not be created, like changes in the statistical properties over time. These changes can possibly make your outputs unreliable. It will require a more thorough analysis over a longer period of time, it’s definitely an area that needs to be considered and actions need to be taken.
Crosser provides a number of tools to address these operations. You have modules that can help you with:
-
Structure and Naming
-
Encoding: use different types of encoding
-
Values and Statistics: do recalculations of values and calculate statistics
-
Validation: use modules for full validation of data sets using JSON Schema, you can validate both values and structure. You can detect issues and errors to then notify someone about the issues or you can block the data from entering the following steps in the pipeline.
By using these tools you can ensure the optimal performance of your pipelines and quality of your data.
Conclusion:
To unlock the potential of enterprise data, and create what’s called - ready-to-use-data-pipelines - throughout the entire organization, many global companies turn to low code platforms like Crosser. Partly for simplifying innovation and development, partly for bridging organizational constraints and silos, but also for just saving time and cost in development. Shortening time to value, for their digital engagements. At the end of the day, data is most valuable when it is accurate, reliable, in-time and contextualized - which is the sole purpose of a stream analytics platform like Crosser.
You can learn more about the Crosser Module Library here
Watch the webinar with Crosser’s CTO Dr. Göran Appelquist -
or Schedule a private demo with one of our Experts.