The rise of digital tools has made it easier than ever for organizations to collect information and analyze it for trends. This sort of data can go a long way to fuel innovation, support scientific advancements, and drive business and research decisions.
But this sort of data can also be dangerous– it can be used to identify an individual. In the case of an education platform, for instance, it could be used to identify someone who has a disability or even their home address.
So how can platforms and tools address the prevalence of Personally Identifiable Information (PII)? This article explores some innovative new tools that our team has been developing and outlines their future.
PII refers to any data that could potentially identify a specific individual. This includes names, social security numbers, addresses, and other unique identifiers. The presence of PII in datasets is a significant concern for researchers due to privacy and security implications. Unauthorized access to PII can lead to identity theft, financial loss, and even lawsuits.
Researchers are particularly wary when handling “unstructured” data like writing samples because the removal of PII from this type of dataset is a daunting and expensive task.
Researchers are particularly wary when handling “unstructured” data like writing samples because the removal of PII from this type of dataset is a daunting and expensive task. The sheer volume of data can make manual review impractical or prohibitively costly. Automated solutions often lack the precision required to ensure that all PII is identified and removed. As a result, many valuable datasets remain unpublished, stifling potential discoveries and innovations.
To address this pressing issue, one of us, Langdon Holmes, and Scott Crossley at Vanderbilt University, alongside The Learning Agency Lab, created the PII Detection Data Science Competition. This competition invited data scientists from around the globe to develop algorithms capable of accurately detecting PII in large datasets. The goal was ambitious yet crucial: to create a tool that could automatically and reliably identify PII, thus allowing for the safe release of datasets without the prohibitive costs associated with manual review.
This competition presented a unique challenge. Data science competitions require a shared dataset, but the goal of the competition was to detect unshareable information. To circumvent this issue, the team at Vanderbilt labeled all the PII in a large collection of student essays. Then, they replaced all of this PII — which could be used to identify real students — with fake identifiers, and released this version of the data to competitors.
Participants in the competition were tasked with creating and training their algorithms on the competition dataset. The challenge required a deep understanding of natural language processing, machine learning, and data privacy principles. Competitors worked to ensure their models could accurately distinguish PII and non-PII data, such as telling the difference between a respondent’s name and a famous name, like “Abraham Lincoln,” since effective algorithms are capable not only of detecting PII, but also preserving information that is not PII.
Over 2,000 teams participated in the competition between January and April in 2024. The competition had 55,847 submissions and featured top talent from around the world. The top-performing algorithms demonstrated an impressive ability to identify PII with remarkable recall and precision. These algorithms can now be utilized by researchers and organizations to automatically scan and sanitize large datasets, making it feasible to publish valuable data without compromising individual privacy.
Significantly, these top algorithms are available at no charge. Anyone can see the code for the algorithms for this competition and use the code on their own data. This democratization of PII detection technology ensures that even small organizations and independent researchers can access powerful tools to safeguard privacy. By lowering the barrier to data publication, the competition has created new opportunities for data-driven innovation.
As we move forward, the ability to effectively and affordably identify PII will empower researchers to share their data more freely, ultimately accelerating progress across a multitude of fields.
As we move forward, the ability to effectively and affordably identify PII will empower researchers to share their data more freely, ultimately accelerating progress across a multitude of fields. In this way, the competition not only safeguards privacy but also fosters a culture of collaboration and discovery in the data science community.