top of page

Unlocking the Power of Your Data: The Essential Guide to Data Integration

Data

The value of data continues to increase as data hungry platforms (Machine Learning and Artificial Intelligence) are integrates into your business processes. Organizations rely on a complex ecosystem of applications and databases that store, collect, or generate data. To unlock the true potential of this information, seamless integration between these systems is crucial. But building these bridges can be a delicate weaving process – one wrong step and you might end up with a data deluge or even expose sensitive information.


We'll dives into the key considerations you need to make when designing and implementing data integration between systems. We'll explore potential pitfalls like data priority, data age, duplicate records, and privacy concerns, along with strategies to address them using both manual batch processes and automated rules.


Too often Privacy considerations is only tacked on at the end when it become very painful to do well. Instead, I would recommend including your legal teams in the design process so that these can be taken into account before much tech constraints impede your ability to create a data privacy respecting solution.


Safeguarding the Flow: Privacy Concerns in Data Integration

Data privacy regulations like GDPR and CCPA place strict obligations on organizations regarding data handling but they generally don't set specific limitations and it us up to the business to navigate the business need for retention, the type of data, and the time frame for which the individual would reasonably expect their data to be retained.

Note: Some regulations (e.g., HIPAA for healthcare data) mandate specific data retention periods for certain data types. Retention rules should ensure compliance with all relevant regulations.

Types of data

  1. Personal Data: This includes any information that can be used to directly or indirectly identify an individual, such as name, address, contact details, and online identifiers like IP addresses.

  2. Sensitive Personal Data (Special Categories of Personal Data): This category includes data that requires stricter protection, such as genetic data, biometric data, health data, and data related to race, ethnicity, political opinions, religious beliefs, and sexual orientation.

  3. High Sensitivity Data: This refers to data that, if compromised or destroyed in an unauthorized transaction, would have a catastrophic impact on the organization or individuals. Examples include financial records, intellectual property, and authentication data.

Data Purpose

  • Operational Data: This data is essential for daily operations, like transaction logs or server activity logs. Retention periods are typically short (days or weeks) unless needed for troubleshooting purposes.

  • Analytical Data: This includes customer behavior data or website analytics used for business insights. Retention rules might depend on the value of the data for future analysis. Some data might be aggregated and anonymized for long-term analysis, while raw data might be deleted after a set period.

  • Historical Data: This data might be archived for legal or compliance reasons, or for historical research purposes. Retention rules are typically long-term, often dictated by regulations or internal policies.


At a minimum inform you users of your privacy policy when they first interact with you and make it available for them to review on your website easily. Ideally if there is a constant exchange of data under different expectations you should integrate visibility / collection reminders as part of the process.

Minimizing your stored data might feel like it only has a down side. If you only integrate the data essential for your business goals you actually reduce the liability you organization takes on when a data breach happens (many agree it's not a mater of "if" but rather "when"). The amount of data your organization store has real world impact on insurance rates etc.


Prioritizing the Data Deluge: Where Does Your Truth Lie?

Imagine you have customer information stored in both your CRM and your e-commerce platform. You will like want to create a field level list of data name/description, approved values, category, and system to get an inventory to start answering the questions.

  • Where do you have duplicate fields (between systems, within the same system)?

  • How are the fields populated and at what cadence?

  • Are there similar data and some should be archived and removed?

  • How can you centralize data intake (by field to standardize how it gets processed and flows to other systems)? This will likely cause process and integration updates to streamline processes

  • Which data variations are only need by that system and should stay only in that system?


In case of discrepancies, which system's data takes precedence?

Data source priority is a critical decision. Here are some strategies to establish a hierarchy:

  • System of Record (SoR): Identify the system that acts as the single source of truth for each data entity (customer, product, order etc.). This system's data should be prioritized during integration. For each field you should stack rank systems that could have this data input

  • Data Freshness: Newer data often reflects the latest changes. Consider prioritizing data based on timestamps, ensuring your integrated view reflects the most recent information.

  • Data Completeness: If one system has more complete information for a specific data point (e.g., a customer's full address), prioritize that system's data during integration.


Manual Batch Process: Regularly review and manually adjust data discrepancies based on the established priority rules. This can be helpful for initial data migration or resolving conflicts in specific cases. You will need to do this at least once to true up all systems. But should consider doing a evaluation quarterly to see if there are leaks in your data flow.

Automated Rules: Implement data quality checks within the integration process. These rules can automatically flag discrepancies and route them for manual review or even automatically update the lower priority system based on pre-defined logic.


The River of Time: How Fresh is Your Data?

Data has a shelf life. Imagine integrating product information that hasn't been updated in months – your integrated view becomes outdated and potentially misleading.

Data age is a crucial factor to consider. Here are some approaches to address it:

  • Data Retention Policies: Establish clear policies for how long data is considered valid in each source system.

  • Versioning: If historical data is valuable, implement data versioning mechanisms to track changes and allow access to older versions.

  • Incremental Updates: Integrate only the most recent data changes instead of transferring the entire dataset periodically.

Manual Batch Process: Regularly schedule data refreshes or clean-up processes to remove outdated data from the integrated system. This should never be a surprise to users of your systems. You can mark the data and give your internal customers a window to review and have a process to evaluate objections and their validity and ground rules for temporary Expiration extensions. This likely needs input from legal teams.

Automated Rules: Set up triggers within the integration process to automatically filter out data that doesn't meet the freshness criteria and flag for archival of outdated versions based on your versioning policy.


Unfortunately, much of this can't be done directly in many CRMs as they don't keep timestamps for field level updates. Let's take a simple example of a SFDC contact (John Smith). A salesperson get a new phone number for them and update the contact. 2 years ago we had enrich with Title and standardized Company name. The problem we now have a Modified Date in SFDC that has a new date and usually the assumption is the entire record was updated... but in this case (and likely most cases) it was only one piece of info. Now because the record has a newer data in SFDC... It will likely update all other systems with this info.

  1. We don't know the age of each piece of data only the record as a whole (so we can't use age to do progressive profiling to ask for update on old fields

  2. if our sync to other systems are not near real-time, we may have new data in a field (i.e. Title) that will get overwritten

This is where have a central system managing a "golden" record that does timestamp field level changes become helpful.


Avoiding Data Echoes

Data echoes, as discussed earlier, occur when the same data record gets duplicated and transmitted multiple times between systems. This can lead to inaccurate data, wasted storage space, and inefficient workflows.


This happens in systems where we only have a modification time stamp on the record and not field level changes and there is data transformations that happen in both the target and source systems. System A has a field updated by a process and updates the time stamp. System B sees there us an updated record and pulls the record for sync and runs it's data processes updating the record and time stamp. System A sees the updated record and syncs it and again runs data processes to update values and updates the time stamp. You can see how this loop can grow if you don't have a central system managing the field level update and pushing out updates to all other systems when there is an actual change.


Tips: Design Pitfalls to Avoid

Beyond the points mentioned above, here are some additional design pitfalls to be aware of:

  • Data Mapping Mismatches: Ensure data values, formats and structures are compatible between source and target systems. Mismatches can lead to data loss or errors during integration.

  • Data Transformation Errors: Test your data transformation logic thoroughly to avoid errors in converting data from one format to another.

  • Security Concerns: Implement robust security measures to protect sensitive data during the integration process.

  • Data Profiling Tools: Utilize data profiling tools to understand the structure and quality of data in your source systems




buymeacoffee_sq.png
subscribe_sq.png
bottom of page