Are your data too slow?
Not everything can be Big Data. Not everything should be either. But some data do need a kick in the pants, so to speak. Are the data you produce or use real time, coming down the pipe as a feed everyday, or are you stuck with years old data for your planning and analysis purposes? If you are in the latter, don’t feel bad — you’re not alone.
For those tracking Ebola outbreaks in West Africa, the stream of data is steady but not real time, yet decisions that impact people’s lives are being made every day about resourcing and responding to this crisis. In the USA there are similarly important data needed — many infections or diseases are notifiable — requiring direct notification of the Centers for Disease Control and Prevention. However regular hospital visits, treatments and surgeries go through a very big, very slow pipeline from local clinics and hospitals up to the state agency level and after processing, refining and some magical treatment, these data flow back to local public health and research agencies some years later. Traditionally this timeline was “all we could do” because of technology limitations and other reasons, but as we rely more and more on access to near real-time data for so many decisions, health data often stands out as a slouch in the race for data driven decisions.
In a different vein, campaign finance data for political donations is sometimes surprisingly fast. In California all donations to campaigns require the filing of a Form 460 declaring who gave the funds, their employer and zipcode. Campaigns are supposed to file these promptly, but this does not always happen until certain filing deadlines. Nevertheless, these data contain valuable insights for voters and for campaigns alike. These data get submitted as a flow, but they then end up in a complex format not accessible to average people — until someone changes that. A volunteer team at OpenOakland created a very powerful automation process that takes these data and reformats them in a way that makes them accessible and understandable to everyone at http://opendisclosure.io. Yet even this system of automated data processing and visualization suffers from a lack of perfectly updated data on a daily basis- the numbers shown each day only reflect the data filed to date, so big donations or changes in patterns do not show up until those are filed — often at a somewhat arbitrary deadline.
Unfortunately not all data are filed frequently and do not come with an easy to use API connection to allow developers and researchers to connect to them directly. Take crime data. Very important information with a high demand for all sorts of decisions at local levels. Your police force may publish good crime data each day or maybe just each month which is useful for real estate people and maybe good for analysts and crime investigations, but how do we know if our local efforts have successfully impacted crime? We go to national data. The Federal Bureau of Investigations (FBI) collects data from most law enforcement agencies in the country and publishes it at as the Uniform Crime Reports (UCR). Unfortunately, these data are published years after the fact. There is a convoluted process for local agencies to format and filter their reports, but then these data take years to get published.
We recently created a violent crime fact sheet using the latest (and recently published) available UCR data — for 2012. This lag in data means that county supervisors and other officials are trying to evaluate the impact of crime prevention efforts but can’t even compare their outcomes with other cities due to the lag in this data – we have to wait for two more years to see if these data indicators changed in other comparable cities, or if our interventions did have a measurable impact. This sort of time lag means that no local officials have good comparable data in a reasonable time frame- a poor system for modern policy makers to rely on. The FBI is working to slowly implement a newer system, but it is not clear that the lag will improve.
Every agency responsible for collecting data for operational purposes MUST start thinking about how it can make these data safely available to decision makers and to the public on an expedited process. The technology is now very accessible to support this, and if necessary we should be considering bifurcated approaches — the old, slow feed to state and federal agencies and a new, agile feed for local use. Privacy standards and quality are simply things that guide how we can do this, they are not actual barriers unless we choose to let them be.
Government is a business, albeit one with a monopoly on services it provides — and it’s not cool for government to be making decisions using years old data when the private sector is increasingly data driven and real time. We can do this!
* First published over at Govloop