The Ultimate Guide to Data Masking

In this article, you'll learn about data masking, why it's important, how to implement it, and the complexities involved.

Michael Nyamande • Nov 28, 2022

Data masking is a technique used in data security to protect private and confidential data in nonproductive environments. This is done by creating fake but realistic and structurally similar data to the original. This masked data can then be used for training, testing, analytics, and sharing with third parties.

Amid an increase in data breaches, the adoption of data masking is increasing. Organizations are using data masking to ensure that users' private data and intellectual property remain safe even if their data is hacked. Data masking also helps businesses comply with regulations, including the General Data Protection Regulation (GDPR) and Health Insurance Portability and Accountability Act (HIPAA).

In this article, you'll learn about data masking, why it's important, how to implement it, and the complexities involved.

What Is Data Masking?

Data masking is a way of obscuring important data in your database by masking it and replacing it with fake data. This means that when a data breach occurs, the attacker will not obtain any private data. It, however, replaces this data with realistic-looking data that is still useful for other organizational purposes, such as testing or analytics.

The following types of data are usually regarded as important and obfuscated during data masking:

  1. Personally identifiable information: This includes social security numbers, names, or license numbers that attackers can use to identify individuals or users of the system.

  2. Payment card information: This includes credit card numbers and banking details that can leave your users vulnerable if exposed. The Payment Card Industry Data Security Standard requires all organizations that handle payment data to keep it reasonably secure.

  3. Health records: These are any medical records of an individual that you maintain. According to regulations like HIPAA, this data should be kept private and never exposed.

  4. Intellectual property: This is data that is private to its organization. This can include confidential data such as business plans or patents.

As you can see, there are all kinds of data that need to be masked, and doing so helps you prevent data breaches, create safe test data, add an extra layer of internal protection, and ensure regulatory compliance.

For instance, the average data breach costs organizations $4.35 million USD. Data masking helps protect confidential data, reducing the impact of breaches on your organization's data and ultimately saving you money.

Data masking is also a reliable way to replicate your production environment for testing without compromising your user's data. It gives testing and development access to an environment similar to production but without any risks of data leakages.

In addition, it adds an extra layer of internal protection, ensuring only users with proper access rights can access users' private data. Anyone without adequate access will be shown masked data. For example, the analytics teams might need access to production data, and data masking allows access but will return masked data for private or protected fields.

Types of Data Masking

There are many ways in which businesses can implement data masking. The choice depends on several factors, including the size of the data, its intended users, and their purposes. Following are the various types of data masking that you can consider implementing:

Static Data Masking

Image of static data masking courtesy of Michael Nyamande

Static data masking (SDM) is a masking technique in which the original database is duplicated and has its data masked and then replicated in a test, development, or staging environment. This technique is referred to as static because a copy is made and masked, and then the masked data is made available.

This form of data masking is usually best when creating an environment for third-party vendors. It gives vendors access to production data, which they can use and manipulate, but it's not up-to-date and doesn't reveal sensitive information.

Dynamic Data Masking

Image of dynamic data masking courtesy of Michael Nyamande

Dynamic data masking (DDM) is a masking technique in which data is masked in real time as it's accessed from the production database. This type of data masking uses role-based security, where only authorized personnel can view sensitive data.

In DDM, all data requests are sent through a database proxy. The proxy checks access rights and alters the results, masking sensitive data if the users do not have permission to access it. This extra process of access rights checking and real-time masking leads to a performance overhead that can make DDM slower than SDM.

DDM can be useful for support personnel who need access to real-time data but shouldn't have access to clients' personal information. For example, support personnel of a bank might need to view clients' banking details and cards. For security purposes, it is best to mask the card number so that only the last four digits are visible. This way, the support agent can assist but cannot view the client's sensitive information.

Deterministic Data Masking

In deterministic data masking, the goal is to ensure that any fields with the same type of data should have equal values even after masking. For example, if you replace the name "Danielle Jones" with "Jane Doe" in the customer table, it should be masked the same way if that same name appears again in the orders and delivery table. This ensures data consistency and maintains your relations even after masking.

On-the-Fly Data Masking

On-the-fly data masking is a hybrid between SDM and DDM. With on-the-fly data masking, two different database environments are maintained. Data from the production environment is then streamed to the duplicate environment at set intervals. During this streaming process, the data is masked, and by the time it reaches the secondary environment, the sensitive data has been replaced. This type of masking ensures your two environments remain in sync while removing the performance overhead of DDM.

On-the-fly data masking is especially ideal for teams who do continuous integration since it ensures the dev, staging, and production environments remain similar while keeping your data safe.

Data Masking Techniques

Data masking techniques refer to the method by which data is masked (ie how to convert real data into its masked form). The following techniques are widely used for masking data:

Substitution

Substitution involves replacing the original data with a fake but credible value. For example, you can replace credit card numbers with numbers that appear real but can't process payments. This provides you with realistic-looking data for testing while keeping sensitive data secure.

Encryption

Just like it sounds, encryption encrypts your data and outputs the encrypted form as a masked value. This makes your data secure since someone can only view sensitive information if they have access to the encryption key.

Nulling

In nulling, the sensitive data is simply removed instead of substituted or encrypted. Any attempt to access the data will return a null response. While nulling ensures the safety of your data, it can also make it less usable, especially for analytical purposes.

Redaction

Redaction is similar to nulling, but the data is replaced with a random value instead of a null value. This ensures that the masking cannot be reverse-engineered but still leaves the data with a similar structure to what it was masking before.

Averaging

Averaging or data generalization is a masking technique in which data values are replaced with average values. For example, in a remuneration table, the salary field can be replaced with an average value. This maintains the integrity of your data since the total and mean salary values remain the same but with no particular salary being attributed to one user.

Shuffling

Another technique that can be used to maintain the integrity of your data is shuffling. Instead of averaging the data, the user's salary can simply be shuffled around. All salary figures remain but cannot be linked to any one individual.

Date Aging

Date aging is used for masking dates. In this technique, the actual data is aged by a fixed amount (eg setting all dates back by 300 days). Unfortunately, this technique can be easily reverse-engineered by anyone who can figure out one date and, therefore, calculate all the other dates in the data set.

Scrambling

Scrambling is a masking technique in which the original value is jumbled up in random order. This technique can be applied to figures and IDs to hide their original value. For example, an order ID number of 45673, when jumbled, can read as 36745.

Value Variance

Value variance is commonly applied to figures and dates. This technique adjusts all figures or dates by a variable amount (eg 10 percent). A common practice is to vary the values so they all fall within the range from lowest to highest. For example, in the remuneration table, you can generate new figures for all the salaries but ensure that they all stay in the range of acceptable salaries (between the highest and lowest figure).

Pseudonymization

Pseudonymization is a data masking technique in which you replace sensitive data with a pseudonym, and it is one of the key requirements for GDPR. For instance, you can replace a user's name with User123 for all the database records. This pseudonym can be derived using other techniques, such as substitution, encryption, or redaction.

Pseudonymization allows a separate record to be kept, which can map the masked data back to its original form. This record, however, should be securely stored under the control of a data controller.

Anonymization

Anonymization is a type of pseudonymization in which there is no way to retain the original data. All sensitive information, such as names, birthdays, and social security numbers, is masked, and the original data is securely deleted to prevent reversal of the process.

Implementing Data Masking

If you're looking to get started with data masking, the following are some best practices and tips that you should take into account before you begin.

Document the Process

The first and most important step to take when beginning your data masking journey is to document the entire masking process. This ensures the process can be repeated and improved upon as the structure of your data and user needs change over time.

Identify and Map Out Sensitive Information

In addition, before you begin to attempt to mask data, you need to identify all instances of sensitive data in your database. Data that is nonsensitive but can be used to infer sensitive data should also be cataloged and masked. Moreover, it's essential that you note who should have access to the data, whether in its masked or unmasked form. Not only should you note who these users are, but you also need to record their access rights as well as how they will use the data. The usage of the data will inform which technique you should use to mask your data.

Choose Appropriate Masking Techniques

Once you've begun documenting the process and identified the sensitive data in your database, you need to choose which techniques are ideal for masking your data. You'll likely implement different masking techniques based on your data, its type, and usage.

It would help if you kept the following factors in mind while choosing data masking techniques:

Format Preservation

It's crucial when masking data that the data maintain its original format. For example, when replacing dates, it's best to replace them with other dates instead of random text. If the original structure is lost, the data loses its meaning and becomes potentially useless. This also applies to data like card numbers, phone numbers, and postcodes. The masked data must remain realistic.

Semantic Integrity

Masked data should also not break the semantics of the data it's replacing. For example, if your user database only takes members older than eighteen, you can't break this rule when masking your data. All masked dates should be within a range acceptable to your given database and system rules.

Data Uniqueness

Another thing to keep in mind is that data, once masked, maintains its uniqueness. If a table uses a user's SSN as its primary key, masking this table with repeated values will cause errors.

This should be applied to all unique data, as repeating it will distort your data and its meaning. You should replace all unique values with equally unique masked values.

Referential Integrity

The final key integrity concern is referential integrity. When masking data that is referenced in other tables, it's essential that those records are masked as well and masked in the same way as the primary data. This means the same technique should be used to mask primary and secondary data.

Test and Secure Masking Techniques

The QA and testing team must ensure that the data masking techniques work as expected. The security teams should also make sure that all techniques implored are also secured. If an encryption key is used to mask data, policies and guards should be in place to make sure these do not get into the wrong hands.

Continual testing and securing of user data are some of the key tenets of GDPR (Article 32) and should be addressed with due care.

Develop a Workflow

Data masking, in most cases, needs to be done frequently as your production environment changes, possibly because of more entries, changes in data structures, or even regulations. You must develop a data masking workflow to make the process repeatable and, at times, even automated. Over time, this will result in lower implementation costs.

As you can see, data masking is a complex process with many security and implementation concerns. While you can implement it manually, using tools that already come with data masking built-in is ideal. For instance, tools like CometChat, an in-app communication platform that comes with built-in data masking, make life so much easier.

CometChat provides a data masking filter, making it hassle-free to implement data masking in your communications. Out of the box, it can mask private details, including credit cards and social security numbers. This reduces the burden and cost of implementation while keeping your data secure and your organization compliant with regulations.

Conclusion

This article served as a comprehensive introduction to data masking. You learned about what data masking is and why it's important in safeguarding your organization's data. You also learned about data masking techniques, how to implement data masking in your organization, and why it's easier said than done.

Implementing data masking from scratch can be a tedious and stressful task. That's why using tools like CometChat makes your life so much easier. If you want to learn more about all of CometChat's capabilities, you can do so in their official docs, which include information about end-to-end encryption, XSS filter, virus scanner, and disappearing messages. Give it a try for free!

Michael Nyamande

CometChat

A digital product manager by day, Michael is a tech enthusiast who is always tinkering with different technologies. His interests include web and mobile frameworks, NoCode development, and blockchain development.

Try out CometChat in action

Experience CometChat's messaging with this interactive demo built with CometChat's UI kits and SDKs.