A common problem in our industry is data mismatch, we want to use data from different sources but they deliver different formats, or we have data in one format and want to send it to another system that expects data in another format. These situations are where data transformation and harmonization comes into play.
The Crosser streaming analytics solution is the optimal tool for resolving these types of issues. In this blog we will look at some examples of how this can be done.
Data Transformation
Transformation and Harmonization are closely related, the only difference being that with harmonization we are dealing with multiple sources and need to apply different transformations to each of them with the goal to convert them into a common format. Let’s start with some typical transformations.
Transformations can be divided into two groups structural and content related.
Structural transformations deals with the format of the data, such as:
- Hierarchical structures
- Arrays
- Objects
- Naming conventions
Depending on what we get and what we want we must be able to convert back and forth between these alternatives.
Content transformations changes the actual data, e.g:
- Scaling of values (e.g. change units)
- Change resolution/sampling rates
- Remove outliers and missing values
- Remove noise
Let’s look at an example:
We want to get the values of 5 registers in a PLC every second and store them in a database. The data from the PLC looks like this:
[
{“Name”: “Reg1”, “Value”: 77},
{“Name”: “Reg2”, “Value”: 935},
{“Name”: “Reg3”, “Value”: “True”},
{“Name”: “Reg4”, “Value”: 18594},
{“Name”: “Reg5”, “Value”: “Good”}
]
We get an array of objects, one per register, each with corresponding “Name” and “Value” properties.
The database expects a key/value map so that values can be mapped to the correct column when we are adding a new row of data. The output we want should look like this:
{
“Temperature”: 25,
“RPM”: 935,
“Running”: true,
“Pressure”: 12.5,
“Quality”: “Good”
}
Let’s see what we need to do to get this output, starting with the input data we have. First the structural problems:
- The array with name/value properties must be changed into an object with key/value pairs
- The names we get from the PLC must be replaced with the proper column names in the database
There’s also a couple of content issues we need to deal with:
- The temperature value we get from the PLC has the wrong unit. We get Farenheit while the database expects Celsius. A conversion is needed.
- The running state is delivered as a string, while the database expects a boolean. A type conversion is needed.
- The pressure value is delivered as a 16-bit integer with the range 0-65535, while it’s actually representing an analog value between 0-100 psi. A scaling is needed.
This example shows some basic transformations you may encounter when working with machine data. Implementing these types of transformations is easy with the Crosser Streaming Analytics system using standard functions from the Crosser module library. The transformations above would end up in a processing flow like this with Crosser:
Other transformations such as removing outliers/noise and changing resolution (aggregation/filtering) can easily be added to the flow above using other standard modules from the library.
Data Harmonization
Data harmonization comes into play when we have multiple sources with different formats and we want to combine the data so that we can treat the data in the same way independent of the original source. To harmonize the data we typically apply different transformations to each of the sources to produce a common format.
The above example also introduces a transformation before the output. Sometimes it is advantageous to first transform each of the inputs to a format that is optimized for processing and then apply a transformation before the output to adapt the data to the requirements of the receiving system.
The transformations of each of the inputs are of the same type as described above. When harmonizing time series data from multiple sources there might be an additional issue that must be dealt with: data from different sources arriving at different times or with different sample rates.
Depending on the requirements of the processing and/or receiving system we might have to align data on common time steps. This can be done by shifting the data, if the sampling rate is the same, or by interpolating/aggregating data if different sampling rates are used.
This is especially important if the data will be used with machine learning models, since these expect each new sample to contain data from each of the sources the model was trained on. A similar problem is when we are missing data from one source at a specific time. We may then need to fill in a value as good as we can, such as repeating the last known value or interpolate a value using data we have received.
Again, these types of data preparations are easily implemented using standard modules from the Crosser library.
Summary
Crosser provides the perfect tools to help you make your data useful. Get insights to optimize your operations and take appropriate actions immediately based on your data. Contact us to discuss how Crosser can be relevant and how to get going in no-time with the self-service capabilities of the platform.
Read more: