Loading and Row Usage for Nested Data Structures

Note: This article is not applicable to all destinations and data types. It applies to:

  • Destinations, which do not natively support nested data structures:
    • Amazon Redshift
    • Amazon S3 (CSV)
    • PostgreSQL
  • Data types
    • PostgreSQL JSON & JSONB datatypes*

*Not applicable for PostgreSQL (v1) – data types they will be stored as strings. In the PostgreSQL (HP) (v2) integration, these data types are stored as JSON objects.

 

For a better understanding of how SageData loads the received data, you need to know more about JSON and its structure.

NoSQL databases and many SaaS integrations are using nested structures, in which, as the name states, each attribute (or column) could have its own set of attributes. For better querying through the data, SageData will split these structures into separate tables.

 

JSON data structures

When pulling data from an integration, SageData is extracting a series of JSON records. JSON defines only two data structures: objects and arrays.

 

Objects

An object is a set of name-value pairs, where each set is called a property. Objects are enclosed in braces ({}), name-value pairs are separated by a comma (,), and the name and value in a pair are separated by a colon (:). Names in an object are strings, whereas values may be of any value type, including another object or an array.

{
"customer_id": "123",
"firstName": "John",
"lastName": "Doe",
"age": 22,
"address": {                // object begins
"street":"Fake Street",
"streetNumber":"5",
"zipCode":"12345",
"city": "JSON City"
} // object ends
}

If SageData would receive an object, the name-value pairs inside the object are split into a table, where columns are created. The created columns have the following naming convention: [object_name]__[property_name]

Following the example above, the columns would be created like that:

 

customer_idfirstNamelastNameageaddress__streetaddress_streetNumberaddress_zipCodeaddress_city
123JohnDoe22Fake Street512345JSON city

 

Arrays

An array is a list of values. Arrays are enclosed in brackets ([]), and their values are separated by a comma (,). Each value in an array may be of a different type, including another array or an object.

Here’s an example:

{
"order_id":"1234567",
"customer_id":"1000",
"order_items":[ // array begins
{
"product_id":"243745",
"price":"7.95",
"currency":"EUR",
}
] // array ends
}

 

SageData will split a nested array or an array that is inside a JSON record (like the example above) and will create a subtable.

 

Deconstruction of nested arrays

For a better understanding of how SageData is deconstructing nested arrays, we will give you a walk-through example using an order record. The order record will consist of 3 parts:

  • Main order data
  • Order items
  • Payment type and fees

Here’s what the JSON for the order looks like:

{
"order_id":"1234567",
"created_at":"2022-05-02 12:45:58",
"customer_id":"1000",
"order_items":[ // order record begins
{
"product_id":"243745",
"price":"7.95",
"currency":"EUR",
"quantity":"2"
"payment_type":[ // payment type record begins
{
"type":"card",
"provider":"MasterCard",
"fee":"0.58"
}
] // payment type record ends
}
]

In this record, you can find three levels of data because of nested arrays. This record contains three levels of data due to the nested arrays. SageData will start deconstructing the JSON from the top-level (main order information) and will create subtables.

 

From this one order record, three tables will be created:

Table nameDescription
ordersContains the core order data: order_id, created_at timestamp, and customer ID.
Order_line _itemsContains order info: product_id, price, currency, and quantity.
orders__line_items__payment_type_linesThis table contains the payment type info: type, provider, fee

 

Subtable connection to top-level data

The subtables created, need to contain information that connects them to top-level records. Therefore, SageData will append a few columns, which will be used as composite keys. Those keys will help you track back the sub records to their parent. The tables, based on the example above, will show you how this exactly works.

In this section:

  • Top-level: Core order data
  • Second level: Order line items
  • Third level: Payment type lines

First level: Core order data

This table contains the order record’s Primary Key, order_id.

order_id [pk]created_atcustomer
12345672022-05-02 12:45:581000

Second level: Order line items

SageData will add further columns to the second level tables, besides the key-value pairs in the nested record:

Column nameDescription
_sdc_source_key_[pk_column_name]Contains the top-level record’s Primary Key.

[pk_column_name] will take the name of the Primary Key column for the top-level record. In this example: sdc_source_key_order_id.

_sdc_level_0_idPart of the composite primary key, used for connecting nested rows down the line. It uses auto-increment for each unique record in the table, beginning with 0. In a combination with _sdc_source_key, it creates a unique identifier for the row.

For our example, the value for the first line item record would be 0, the second 1, the third 2, and so on.

 

Here’s what the order__line_items table would look like if another line item were added to the order record:

_sdc_source_key_order_id_sdc_level_0_idproduct_idpricequantitycurrency
123456702437457.952EUR
1234567187238215.401EUR

 

If you wanted to return all line items for order number 1234, you’d run the following query:

SELECT *
FROM orders__line_items oli
WHERE _sdc_source_key_order_id = 1234567

Third level: Payment type lines

SageData also adds further columns to third level records:

Column nameDescription
_sdc_source_key_[pk_column_name]Contains the top-level record’s Primary Key.

[pk_column_name] will take the name of the Primary Key column for the top-level record. In this example: sdc_source_key_order_id.

_sdc_level_0_idThe foreign key for the second level (order__line_items) table. When combined with the source key (_sdc_source_key_order_id), it helps you find the parent
_sdc_level_1_idPart of the composite primary key, used for connecting nested rows down the line. It uses auto-increment for each unique record in the table.

For our example, the value for the first line item record would be 0, the second 1, the third 2, and so on.

 

Here’s what the order__line_items__payment_type_lines table would look like if another payment type line were added to the order record:

_sdc_source_key_order_id_sdc_level_0_id_sdc_level_1_idpricetypefee
1234567007.95Mastercard0.58
12345671015.40PayPal0.11

 

If you wanted to find the order line items and payment type lines for order number 1234567, you will use the following query:

SELECT *
FROM order__line_items oli
INNER JOIN order__line_items__payment_type_lines ptl        ON ptl.__sdc_level_0_id = oli.__sdc_level_0_id       AND ptl.__sdc_source_key_order_id = oli.__sdc_source_key_order_id
WHERE _sdc_source_key_order_id = 1234567

Total row count in your destination

As SageData is using a special method of deconstructing nested arrays, it is very likely that you will see more rows in SageData and your destination, than the initial data in the source. This means that the row count in the initial data source won’t be equal to the row count in your destination and the SageData app.

Looking at the example above for order 1234567: it is not a single row in the destination

Top-level record – row in the orders table:

order_id [pk]created_atcustomer
12345672022-05-02 12:45:581000

 

Second level record – rows in the order__line_items table:

 

_sdc_source_key_order_id_sdc_level_0_idproduct_idpricequantitycurrency
123402437457.952EUR
1234187238215.401EUR

 

Third level record – rows in the orders__line_items__tax_lines table:

_sdc_source_key_order_id_sdc_level_0_id_sdc_level_1_idpricetypefee
1234567007.95Mastercard0.58
12345671015.40PayPal0.11

 

The total number of rows created by SageData in your destination in this case will be a count of 5. As it is visible from the example – there is a single order record, that can create 5 different rows on replication in your destination.