Part 1: PII Detection and Anonymization of Real-Time Data
An introduction to anonymization of personally identifiable information in unstructured streaming data
Introduction
github repo for this blog: https://github.com/stephenmooney1976/pii-detection
Personally Identifiable Information (PII) is information that, when used alone or with other relevant data, can uniquely identify an individual. PII may contain identifiers (e.g., social security number) that can directly a person, or quasi-identifiers (e.g., nationality) that can be combined with other quasi-identifiers (e.g., date of birth, hospital procedure) to successfully recognize an individual.
Examples of PII include but are not limited to:
Full Name
Social Security Number (SSN)
Driver’s License Number
Mailing Address
Credit Card / Debit Card Numbers
Passport Information
Financial Information
Medical Records
Protecting PII is essential for personal privacy, data privacy, data protection, information privacy, and information security. Even though this is not a new problem, the explosion of new data sources and social media has expanded opportunities for data exploits.
To further this point, the veracity of data and its available sources makes detecting PII all the more difficult. While structured PII anonymization challenges still exist, unstructured data from social media, text message, email, and free-text internet feeds creates an even more challenging environment to protect individual privacy.
State-of-the-art technology as a solution
Until recently, the conventional wisdom was that while AI was better at humans at data-driven decision making, it was still inferior to humans for cognitive and creative ones. Within just a handful of years, language based AI has made great advances, changing common notions of what this technology can do.
Natural language processing (NLP) - the branch of AI focused on how computers can process language like humans do - has made profound advances. The broad awareness of technologies such as ChatGPT denotes just how far we have come. In terms of dealing with PII, leveraging these technologies as a solution is a natural progression.
Presidio (origin from Latin praesidium ‘protection, garrison’) is an SDK aimed at using AI for fast identification and anonymization modules for private entities in text such as credit card numbers, names, locations, social security numbers, financial data, and more (and it also works on image data).
By default, Presidio uses spaCy as its engine for PII identification and extraction. While this is great for detecting PII in text, the veracity of new fields that may be synonymous with PII continues to evolve. Thankfully, Presidio is extensible with other 3rd party Named Entity Recognition (NER) NLP models to detect and mask even more PII.
This post will demonstrate how to implement Presidio with an additional HuggingFace NER model to detect and anonymize PII <from a live twitter feed?>
The implementation
PII Detection
The first thing needed are some base python libraries. These are in a requirements.txt file and include:
huggingface_hub
presidio_analyzer
presidio_anonymizer
presidio_image_redactor
spacy
!pip install -r requirements.txt
Next, we want to download the NER model used to extend our PII detection:
from huggingface_hub import snapshot_download
repo_id = 'dslim/bert-base-NER'
model_id = repo_id.split('/')[-1]
snapshot_download(repo_id=repo_id, local_dir=model_id)
Now that we have everything downloaded and installed, the next step is to create our base Presidio analyzer and extend it with a custom transformer to implement the HuggingFace recognizer. In order to do this, we need to create a class that extends the Presidio EntityRecognizer
class and implements the load
and analyze
methods. Documentation for implementing this class can be found here.
These sections are images due to formatting issues. The code can be found in the github repo mentioned at the beginning and end of this blog.
Here is the class that is used to implement the NER EntityRecognizer:
Now that we have the NER EntitiyRecognizer built, we can now use this by creating a Presidio AnalyzerEngine instance and registering this EntityRecognizer to the execution pipeline. This is done as follows:
model_dir = 'bert-base-NER' # directory that we downloaded HuggingFace to above
xfmr_recognizer = TransformerRecognizer(model_dir)
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(xfmr_recognizer)
A quick execution of this analyzer shows us that our analyzer is able to locate PII:
text = "His name is Mr. Jones and his phone number is 212-555-5555"
analyzer_results = analyzer.analyze(text=text, language="en")
print(analyzer_results)
"""
[type: PERSON, start: 16, end: 21, score: 0.944421648979187,
type: PHONE_NUMBER, start: 46, end: 58, score: 0.75
]
"""
By default, Presidio uses generic tags to replace data. For instance, anonymizing with default settings and instances would yield us text that looks like this:
# initialize the anonymizer. this is not the extended EntityRecognizer
anonymizer_engine = AnonymizerEngine()
# create anonymized results
anonymized_results = anonymizer_engine.anonymize(
text=text, analyzer_results=analyzer_results
)
print(anonymized_results)
"""
text: His name is Mr. <PERSON> and his phone number is <PHONE_NUMBER>
items:
[
{'start': 49, 'end': 63, 'entity_type': 'PHONE_NUMBER', 'text': '<PHONE_NUMBER>', 'operator': 'replace'},
{'start': 16, 'end': 24, 'entity_type': 'PERSON', 'text': '<PERSON>', 'operator': 'replace'}
]
"""
We can also use custom operators in our code to provide more desirable masking of PII. Below is an example:
operators = {
"DEFAULT": OperatorConfig("replace", {"new_value": "<ANONYMIZED>"}),
"PHONE_NUMBER": OperatorConfig(
"mask",
{
"type": "mask",
"masking_char": "*",
"chars_to_mask": 12,
"from_end": True,
},
),
"US_SSN": OperatorConfig(
"mask",
{
"type": "mask",
"masking_char": "#",
"chars_to_mask": 10,
"from_end": False
}
),
"TITLE": OperatorConfig("redact", {}),
}
# initialize the anonymizer. this is not the extended EntityRecognizer
anonymizer_engine = AnonymizerEngine()
# create anonymized results
anonymized_results = anonymizer_engine.anonymize(
text=text, analyzer_results=analyzer_results, operators=operators
)
print(anonymized_results)
"""
text: His name is Mr. <ANONYMIZED> and his phone number is ************
items:
[
{'start': 53, 'end': 65, 'entity_type': 'PHONE_NUMBER', 'text': '************', 'operator': 'mask'},
{'start': 16, 'end': 28, 'entity_type': 'PERSON', 'text': '<ANONYMIZED>', 'operator': 'replace'}
]
"""
Streaming PII and detection:
In order to generate PII for this example, I used OpenAI to generate a bunch of random PII text. In this installment I will use this PII and feed it through a Kafka producer to a Redpanda cluster that is running locally. For those of you who are not familiar with Redpanda, it is a drop-in replacement for Kafka that is compatible with all Kafka tools and APIs. It is also written purely in C++, which makes it extremely fast.
First, to generate random PII, I created a program that creates a simple OpenAI prompt and loops until I get the number of desired output JSON records. This could be done in one program, rather than writing to an intermediary file. I am simply trying to keep my OpenAI costs down:
#!/usr/bin/env python
from dotenv import load_dotenv
import json
import openai
load_dotenv('.env')
model_engine = "text-davinci-003"
num_samples = 10
prompt = "Generate text with a bunch of random PII. Please do not use name John Smith. or Jane Doe. Please use a random name."
output_results = list()
for ii in range (num_samples):
completion = openai.Completion.create(
engine=model_engine,
prompt=prompt,
max_tokens=2048,
n=1,
stop=None,
temperature=0.5
)
response_text = completion.choices[0].text.strip()
output_results.append(dict(inputs=response_text))
with open('data/pii_records.json', 'w', encoding='utf-8') as f:
json.dump(output_results, f, ensure_ascii=True, indent=2)
Sample records of the resulting file look like this:
[
{
"inputs": "Meet Mary Johnson, a 34-year-old software engineer from Chicago, Illinois. Mary has been working in the industry for 11 years and currently works for a large tech company. She recently moved to the Bay Area and is enjoying the new lifestyle. Mary is a big fan of outdoor activities such as camping and hiking, and she loves to travel around the world. She\u2019s also an avid reader and enjoys cooking on the weekends. Mary\u2019s social security number is 847-51-6329 and her driver\u2019s license number is L934-908-1228."
},
{
"inputs": "Meet George Bailey, a 25-year-old software engineer from Chicago, Illinois. George enjoys playing video games, reading fantasy novels, and exploring the outdoors. He recently moved to Seattle, Washington and is excited to explore the city. George's Social Security number is 872-32-9087 and his driver's license number is C188815. George's email address is georgebailey@example.com and his phone number is (206) 555-0130."
},
...
]
We can then write a simple Kafka JSON producer that opens this data and publishes it to the Kafka/RedPanda topic:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Created on Wed Aug 9 13:21:29 2023
@author: Stephen Mooney
"""
from dotenv import load_dotenv
import json
from kafka import KafkaProducer
import os
import time
""" read in information about started redpanda environment """
load_dotenv('redpanda.env')
""" create producer """
producer = KafkaProducer(
bootstrap_servers = os.environ.get('RPK_BROKERS'),
value_serializer=lambda m: json.dumps(m).encode('ascii')
)
topic = "random-pii-text"
def on_success(metadata):
print(f"Message produced to topic '{metadata.topic}' at offset {metadata.offset}")
def on_error(e):
print(f"Error sending message: {e}")
""" read in OpenAI generated PII """
with open('../data/pii_records.json') as f:
l_json_data = json.load(f)
""" push messages to toic from OpenAI """
for ii in range(len(l_json_data)):
msg = dict(id=ii, inputs=l_json_data[ii]['inputs'])
future = producer.send(topic, msg)
future.add_callback(on_success)
future.add_errback(on_error)
time.sleep(0.100) #
""" flush and close producer """
producer.flush()
producer.close()
Redpanda does require Docker desktop in a local environment. I created a script that starts a three-node Redpanda cluster using the rpk command. Documentation for this command can be found here. I am massaging the output of rpk and saving these values as environment variables for use in other programs within this blog.
#!/usr/bin/env bash
L_NUM_CONTAINERS=3
rpk container start -n ${L_NUM_CONTAINERS} | grep export | sed -e 's/^[ \t]*//' > redpanda.env
exit 0
Then to produce data to topics, within the Jupyter environment, we can call the producer and publish the raw messages with PII using the above producer:
Message produced to topic 'random-pii-text' at offset 0
Message produced to topic 'random-pii-text' at offset 1
Message produced to topic 'random-pii-text' at offset 2
Message produced to topic 'random-pii-text' at offset 3
Message produced to topic 'random-pii-text' at offset 4
Message produced to topic 'random-pii-text' at offset 5
Message produced to topic 'random-pii-text' at offset 6
Message produced to topic 'random-pii-text' at offset 7
Message produced to topic 'random-pii-text' at offset 8
Message produced to topic 'random-pii-text' at offset 9
Message produced to topic 'random-pii-text' at offset 10
Message produced to topic 'random-pii-text' at offset 11
Message produced to topic 'random-pii-text' at offset 12
Finally, to consume messages and apply the Presidio transformer to the consumed stream, we want to create a consumer group that implements this transformation on data.
from kafka import KafkaConsumer
consumer = KafkaConsumer(
bootstrap_servers=os.environ.get('RPK_BROKERS'),
group_id="demo-group",
auto_offset_reset="earliest",
enable_auto_commit=False,
consumer_timeout_ms=1000,
value_deserializer=lambda m: json.loads(m.decode('ascii'))
)
topic = "random-pii-text"
consumer.subscribe(topic)
anonymized_json = list()
try:
for message in consumer:
topic_info = f"topic: {message.partition}|{message.offset})"
message_info = f"key: {message.key}, {message.value}"
original_json = message.value
original_text = original_json['inputs']
analyzer_results = analyzer.analyze(text=original_text, language="en")
anonymized_text = anonymizer_engine.anonymize(text=original_text,
analyzer_results=analyzer_results,
operators=operators)
original_json['inputs'] = anonymized_text.text
anonymized_json.append(original_json)
except Exception as e:
print(f"Error occurred while consuming messages: {e}")
finally:
consumer.close()
Here is an example of the original input data (first 5 records):
Meet Mary Johnson, a 34-year-old software engineer from Chicago, Illinois. Mary has been working in the industry for 11 years and currently works for a large tech company. She recently moved to the Bay Area and is enjoying the new lifestyle. Mary is a big fan of outdoor activities such as camping and hiking, and she loves to travel around the world. She’s also an avid reader and enjoys cooking on the weekends. Mary’s social security number is 847-51-6329 and her driver’s license number is L934-908-1228.
Meet George Bailey, a 25-year-old software engineer from Chicago, Illinois. George enjoys playing video games, reading fantasy novels, and exploring the outdoors. He recently moved to Seattle, Washington and is excited to explore the city. George's Social Security number is 872-32-9087 and his driver's license number is C188815. George's email address is georgebailey@example.com and his phone number is (206) 555-0130.
Alice Johnson just moved to the city and is looking for a new job. She recently updated her driver's license with her new address, which is 123 Main Street, Anytown, USA. Her social security number is 789-456-1234 and her date of birth is April 10th, 1985. Alice also recently applied for a new credit card with her new address, and the card number is 4242-4242-4242-4242.
Meet George Harris, a 21-year-old from San Francisco, California. George is a student at the University of California, Berkeley, studying computer science. He has a driver's license number of GX6A9-2B5F-42Y1 and a social security number of 890-21-9456. George's phone number is (415) 567-3490 and his email address is georgeh@example.com. He is passionate about programming and loves to play video games in his spare time.
Hello, my name is Paulina Bautista. I'm a 25-year-old student from Miami, Florida. I recently graduated from Florida State University with a degree in Business Administration. I'm currently looking for a job in the finance sector. I'm also interested in learning more about data privacy and security, as I'm sure many of my peers are. I'm excited to see what the future holds for me!
After anonymization, the first 5 records now look like this:
Meet <ANONYMIZED>, a <ANONYMIZED> software engineer from <ANONYMIZED>, <ANONYMIZED>. <ANONYMIZED> has been working in the industry for <ANONYMIZED> and currently works for a large tech company. She recently moved to <ANONYMIZED> and is enjoying the new lifestyle. <ANONYMIZED> is a big fan of outdoor activities such as camping and hiking, and she loves to travel around the world. She’s also an avid reader and enjoys cooking on <ANONYMIZED>. <ANONYMIZED>’s social security number is ########### and her driver’s license number is <ANONYMIZED>-908-1228.
Meet <ANONYMIZED>, a <ANONYMIZED> software engineer from <ANONYMIZED>, <ANONYMIZED>. <ANONYMIZED> enjoys playing video games, reading fantasy novels, and exploring the outdoors. He recently moved to <ANONYMIZED>, <ANONYMIZED> and is excited to explore the city. <ANONYMIZED>'s Social Security number is ########### and his driver's license number is <ANONYMIZED>. <ANONYMIZED>'s email address is <ANONYMIZED> and his phone number is (2************.
<ANONYMIZED> just moved to the city and is looking for a new job. She recently updated her driver's license with her new address, which is 123 <ANONYMIZED>, <ANONYMIZED>, <ANONYMIZED>. Her social security number is ************ and her date of birth is <ANONYMIZED>. <ANONYMIZED> also recently applied for a new credit card with her new address, and the card number is <ANONYMIZED>.
Meet <ANONYMIZED>, a <ANONYMIZED> from <ANONYMIZED>, <ANONYMIZED>. <ANONYMIZED> is a student at the <ANONYMIZED>, <ANONYMIZED>, studying computer science. He has a driver's license number of GX6A9-2B5F-42Y1 and a social security number of ###########. <ANONYMIZED>'s phone number is (4************ and his email address is <ANONYMIZED>. He is passionate about programming and loves to play video games in his spare time.
Hello, my name is <ANONYMIZED>. I'm a <ANONYMIZED> student from <ANONYMIZED>, <ANONYMIZED>. I recently graduated from <ANONYMIZED> with a degree in Business Administration. I'm currently looking for a job in the finance sector. I'm also interested in learning more about data privacy and security, as I'm sure many of my peers are. I'm excited to see what the future holds for me!
Conclusion and next steps
In this post we have taken records with PII inside a Kafka stream and applied anonymization to these records using a Kafka consumer (using Redpanda as our Kafka implementation). As you can see, it is relatively straight-forward to detect and anonymize data within real-time data
In the next part of this series, we will explore creating stream processors that will allow these transformations to take place within Kafka itself. Ultimately in this series, we are going to integrate this and other Large Language Models (LLMs) into real-time databases.
The github repo for this example can be found here. I look forward to any questions and comments.