Clarity on COVID-19 data

It’s now the 20th of July and the COVID-19 cases data is up to the 14th of July. The Tests data is up to the 13th. Basically the data is now week off from currency.

Given how topical this information is at the moment it’s hard to understand why this data has proven to be, and continues to be, so unreliable.

Is there any way to get some better understanding of what’s going on behind the scenes with this data?

I, and others, have posted repeatedly with individual issues but there seems to be an underlying problem with currency and things breaking that makes relying on this data impossible and leaves those of us using it to present to wider audiences embarrassed when it looks like what we’re doing is broken.

If there is some underlying issue it’d be great if you could bring us into the loop and give us some context. Because right now the situation is just constantly frustrating.

Thanks

Hi Evan,

Prior to publishing the COVID-19 open datasets, the NSW Chief Data Scientist conducts an assessment to measure the risk of identifying an individual and to measure the information gained if it was known that an individual was in the dataset. Based on this assessment, the Chief Data Scientist makes a recommendation for whether the data is safe to release, or if the data needs to be treated to mitigate these risks. This can be by suppressing or delaying the release of the data.

The methodology is quite complicated and the process is still quite manual. The benefit of using this methodology is that it has allowed us to safely publish the unit level open data. You can learn more about the process here if you wish: https://www.acs.org.au/insightsandpublications/reports-publications/privacy-preserving-data-sharing-frameworks.html

Happy to answer more specific questions as well.

Thanks for the response Lance.

I guess my first question revolves around the disparity between the information being released in the dataset and the exact same or even more detailed information being released in press conferences etc.

This was obvious when we had a smaller number of cases and the dataset was days out of date but the information was already available publicly.

So is it the privacy determination that’s causing the delay or the manual process of making that determination?

And how can we tell? From the end users’ point of view, without any further information, the data just looks out-of-date.

© Data.NSW