How Grab’s Data Streaming Team Protects User Privacy with PII Masking
Introduction
Grab’s data engineers play a crucial role in designing and building machine learning models that provide strategic insights using the data that flows through the Grab Platform. In order to refine these models and ensure they work effectively in production, data engineers require access to actual production data. However, data engineers must be prevented from accessing any Personally Identifiable Information (PII) of the users for maintaining privacy.
PII Tagging
Grab leverages the Protocol Buffers (protobuf) data format to structure in-transit data. When creating a new stream, developers must describe its fields in a protobuf schema that is then used for serialising the data wherever it is sent over the wire, and deserialising it wherever it is consumed. Here, the developers must tag fields containing PII with a PII label like PII_TYPE_NAME. The passengerName field is an example of PII, which must be flagged accordingly.
CI Pipeline
A Continuous Integration (CI) pipeline ensures that all PII fields are correctly tagged. Developers need to publish the schema of their new stream into Coban’s Git repository. The CI pipeline runs an in-house Python script to scan each variable name of the committed schema and test it against an extensive list of PII keywords. If there is a match and the variable is not tagged with the expected PII label, the pipeline fails. Approval from the Coban team is required for updating the whitelist.
Production Environment
Data streaming at Grab uses Kafka to produce and consume data. The production environment is where user-generated data is produced by interacting with the Grab superapp. The booking service generates Kafka records and produces them for other services to consume. Machine learning pipelines are among the consuming services that require PII data. As access to the production environment is highly restricted and monitored, PII is not masked in this process.
PII Masking
Data engineers are not granted access to the production environment, instead, they access a staging environment where PII is masked. An in-house Flink application residing in the production environment performs PII masking. It consumes the original data as a regular Kafka consumer and dynamically masks the PII based on the PII tags of the schema. The sanitised data produced at the Kafka cluster is then consumed by the staging machine learning pipelines. The Kafka cluster in the staging environment is secured with authorisation and authentication.
Conclusion
Grab’s data streaming team (Coban) enforces PII masking on machine learning data streaming pipelines to ensure the security and privacy of users while enabling data engineers to refine their models with sanitised production data. The CI pipeline verifies that all fields describing PII are correctly tagged, and the in-house Flink application dynamically masks the PII. Grab’s mature privacy programme ensures that users’ data are well-protected against any human errors.
GIPHY App Key not set. Please check settings