I love shopping for tools. You may have heard the saying “you can’t have too many tools.” It is quite frustrating when you don’t have the right tool for the job! In the software world, developers have the propensity to look at data in the same way that I look at buying tools; i.e., “I might need this in the future.” In this article, I hope to convince you to dispel this gobble-up-all-the-data thought process.
Let’s visualize your application as a ship. The data you use serves as the ship’s ballast, a heavy material, such as water or sand, placed low in a vessel to improve its stability. Just enough ballast keeps the boat stable and maximizes your cargo. Any more ballast than the minimum required adds unnecessary risk to your journey (and also reduces the cargo that you can carry).
Too much data may translate into a poor experience and lose user trust. Your users may ask themselves, "Why do I have to share my location on this Solitaire app?"
Too little data may translate into poor user experience because you may not be able to tailor and personalize the UX for your users.
Having the right (minimal) amount of data should maximize the user experience. Every piece of data must serve a purpose for the user and bring value.
Developers want to provide the best experience for their users. Often, providing a great user experience means collecting their data. For example, in an application, a developer may use location data to set a geofence to improve the user’s experience while they visit a physical store. What data should they collect and how should they handle it?
For each piece of user data collected, there are hidden costs of that data in the form of:
- Compliance (GDPR, HIPAA, PCI, etc)
Each of these topics are entire disciplines, and there is a lot of literature — valuable or otherwise — covering them in heavy detail. We will be focusing broadly on reducing our risk across these areas.
Real example: Abusing Data From Open Sources and Breaches
Let’s consider a real-world attack where attackers were able to abuse data from open sources and recent data breaches, such as Equifax and other smaller ones, to create ~$25k in credit card charges
Someone I know was directly affected by a breach in security controls through data gathered through Open Source Intelligence (OSINT) and likely information purchased on the dark web obtained through data breaches.
We’re going to call this person Candice (fake name to protect the innocent). One day, Candice opens her credit card statement and notices that there's ~$25,000 on the card that she didn't spend and a new account was added.
Candice and I did a little research, and here’s what we think happened.
Candice is your average consumer. She turns on multi-factor authentication (MFA) for her banking accounts when it is required and she is also moderately successful at using different passwords for each of her accounts.
Like many of us, Candice’s information, including her Social Security number, was exposed in the Equifax breach and some smaller breaches. She probably also re-used her password elsewhere. With some credential stuffing, where attackers use previously breached passwords on other sites in an attempt to gain access, attackers were likely able to gain the first step of authentication but were not able to get past the multi-factor authentication challenge.
However, with a little extra OSINT to obtain her address, phone number, age, family members, pets’ names, etc., the attackers were able to build a whole profile on Candice and use it for social engineering and identity theft. Armed with the right combination of details about Candice, attackers were able to bypass the security controls enforced by the bank through customer support. The bank eventually flagged the fraudulent charges and, luckily, Candice’s credit ended up fine.
As a savvy user of the internet, it is always a best practice to consider the information that you share could be used for other purposes. For example, some Facebook memes included lists of first or important life events, such as “where did you first meet your spouse.” This is a gold mine of OSINT for attackers that are trying to access websites that still use security questions as additional factors.
Each individual will need to consider their own threat model on the information they share. It isn’t easy to anticipate where OSINT data can be collected and how it could be abused, and this is where developers come in as first defense. As Vanderbilt University James Hazel told the Business Insider in this excellent piece on anonymized DNA data, "Data is data — once it's out there, it's very hard to control." Treating user data as a liability protects your users by making it more difficult to obtain and be used maliciously in other places you might not expect.
"Real-world example: Equifax breach + smaller breaches + OSINT = ~$25K+ credit card charges. Find out how @auth0’s security-minded approach to user data design could have helped."
Back to our geofencing problem. Let’s say your company wants to know the following information:
Adoption rate: how many people are using the application in the store
Conversion rate: whether a person used the app and then went to the store within a certain period
There are a lot of inputs available to answer these questions:
- GPS location
- Nearby WiFi
- Nearby Bluetooth
- IP Address
- Unique ID of the device
- User ID
- Items searched
We could store all of this data as a single record and send it to the server every time the user logs in the app. If the record is within the geofence, we can process whether or not it was a conversion and increase the adoption count. This seems like a lot of information for a couple of reasonably simple questions (resist that feeling of “but we may need this data in the future”). Don’t forget to call in compliance and security on this one. In other words, we’re carrying a lot of ballast in this ship.
Instead, what if we stored a record that included the following in the client application:
These two data points represent the condition whether the user within the geofence the last time they logged in. To determine if the user is within the geofence, our application would store the geofence boundary and request permission to use location services when it is running. In this case, we don’t really need to save the
inGeofence information on the server, we can store it temporarily and use it to determine whether the next request is a conversion.
Side note: Let’s acknowledge that client-side calculations are not always appropriate, make sure to consider your threat model.
When the user logs in to the application, the application can check the previous
inGeofence record as well as calculate a new one. Then, if the user is within the geofence, the application could send the following to the servers:
UniqueId would serve as a way to uniquely identify our new data point. In this case, we could choose to use a random string.
isConversion is inferred as true if the last login was not geofenced and occurred within our specified timeframe. We’ll increment our adoption count as aggregated data based on the number of unique requests. We have managed to take rich real-time user data and only transfer and persist the results that we want to measure. We have accomplished our business needs while minimizing risk, and can follow this same process when new features are needed. This process will help us purposefully select the data that we need rather than use whatever we can collect.
The design scenarios that you encounter will likely be much more complex than the geofencing example above. Consider the following user data security principles to kickstart the discussion on your next project remembering that each piece of user data you collect is a liability:
Clarify your objective and identify the minimum amount of data necessary to meet it
Design the use of data in your applications to address only the specific need
Consult compliance, security, and privacy experts on your data choices
Transfer and store aggregated, calculated or inferential data over raw user data
Aggregated: adoption count vs sending the user’s identifying data
inGeofencevs sending the user’s GPS location
isConversionvs sending the user’s login history, timestamp and GPS location
Resist the urge “because I might need it”
Consider how the data could be abused when combined with other sources, such as what happened to Candice
"Treat data as a liability. Find out how designing for only the data you need can protect your end-user and your company."
On your next design, take the engineering challenge of minimizing user data just like you would consider scalability or performance. Throughout the design process sometimes you won’t be able to reduce your user data liabilities. At a minimum, incorporating the security principles above helps you identify those liabilities and handle them appropriately (some boats need a lot of ballast just to stay afloat). The next time you are ready to “ship it”, check your ballast and you will be better prepared for rough waters. If you'd like to learn more about how Auth0 can help you securely collect only the authentication and authorization data that you need, please reach out to an Auth0 resource.
Auth0, the identity platform for application builders, provides thousands of enterprise customers with a Universal Identity Platform for their web, mobile, IoT, and internal applications. Its extensible platform seamlessly authenticates and secures more than 2.5B logins per month, making it loved by developers and trusted by global enterprises. The company's U.S. headquarters in Bellevue, WA, and additional offices in Buenos Aires, London, Tokyo, Sydney, and Singapore, support its customers that are located in 70+ countries.