just notes

Imputation and when Listwise deletion might make sense


{update 2022: I’m not sure I’m explaining this concept accurately. Beware…}

Imputation is the practice of filling in data when it’s missing from a set.  One must take care in how they use this.  For instance, you might use linear regression to predict a small amount of data missing.

Consider sales data for our grocery chain. One store had a server error for an hour or two.  Dang!  That clunky legacy data processing script we used wasn’t capturing metadata about the items sold during that time. Oh no! We don’t have any other good redundancy for going back over that data and filling in the actual numbers. (I’m sure this never happens in real life, but play along.)

Let’s say the metadata was the distance that food traveled to be delivered to the store.  Every time the item sells, the transaction affects the average distance food travels to that store to be sold.  Also, most items don’t have the attribute, to make it more complicated.  About 30% have this attribute and need to be tracked this way, hence the separate tracking system that failed above.

Getting into the meat of the impact of our data loss, imagine our purpose of that data is to be able to show our customers, some percentage of the food they buy over the month is from 100 miles around the store or less, and give our individual shoppers a monthly report.  They care about these things, and they want to know they’re impacting our local farmers with their dollars.

Enter our amazing fictitious customer Joni.  She shops with us every week, once a week, and shops big.  Her life-hack is to use that metric as a game, and share it with her friends to ensure she’s putting her money into both her own health (buying a lot of fresh, local food) and supporting her local economy.

Unfortunately, Joni shopped during that hour we were down, and we only have the total she spent. We lost how much was “local”. Now we have a dilemma.  What techniques will we use to correct the problem?  Well, to me, it seems like there are quite a few factors to consider.  If we are presenting individualized data to shoppers, it is dishonest to use imputation to fill in what we essentially will be guessing they spent.  In that case, it would probably be more ethical to just not include it in any totals, noting the error.

To smooth things over, we might include averages, using listwise deletion of the data, to completely remove the shopping trip from the average local shopping dollars per trip.  Did I mention Joni also users her data from us to track average spending per week at our store?  Let’s say that our table that stores her local sales, also stores her total sales data which is not missing.   If we choose to just drop that data, what computations does that make sense for? Seems like a total sum would be a bad choice for that.  Imputation, using a linear regression over the last two years, and maybe even the rest of the month that followed, might be a good option.  Unless we’re just reporting averages, then perhaps dropping the data out would be better.

There are no simple answers. Depending on the meaning of the data, we can smooth out mistakes, but again, it must be done while being true to the intent of the data. Joni, if they felt ownership over this data they may have access to for their own needs, probably wouldn’t like the fact that we replace her gap in data with an average if we purported that it was the actual amount spent.  But the store leaders would probably forgive the assumption of an average for a small portion of the sales, rather than see a strange dip in the numbers.  The key is how well the flaw is communicated.  Transparency is always your friend in these cases.


Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.