In this blog post we share an interview with Avinash Celestine & Avinash Singh, co-founders of How India Lives. Covering an array of topics, they share their experiences and thoughts on handling public data and the nuances of data visualisation that has to be kept in mind when dealing with complex datasets.
Tell us about how “How India Lives” came into being and the gaps that the platform is trying to address?
How India Lives was founded by five senior journalists after we felt reliable and relevant data, especially demographic, is missing in India. Often, we used to spend tremendous amount of time to collect and clean the data. We started the venture three years back and it was incubated at the Tow-Knight Center at the City University of New York. We received a grant from Tow-Knight Foundation to kick start the operations. The vision is not to duplicate existing data products, but to create products where reliable data can be easily accessed.
How India Lives is a platform to distribute all kinds of public data in a searchable form, aiming to help people make informed business and policy decisions. Currently, data-driven decision making by a range of users – marketers, researchers, retailers, governments etc is driven by customer profiles created on the basis of sample surveys conducted by a number of companies. While these surveys have their uses, we feel that public data, organised under the Census, the National Sample Survey and other sources, can dramatically improve the way companies understand their customers. Their coverage is comprehensive and highly granular (the Census, for example, provides data on hundreds of variables down to every village and ward in India) in a way that many private consumer-oriented surveys are not. Rich possibilities also reside in several datasets beyond the well-known ones. Some of these datasets can be used as proxies to answer questions related to understanding the consumer landscape or making public policy interventions.
However, to organize public data so as to be useful to users, requires a major effort. Much public data is scattered across different databases (in many cases, it is not even organised into a database), which don’t ‘talk’ to each other, or are in unusable formats (e.g. pdf). Our aim is to reorganize public data into a common, machine readable format, and in a way that users can search for data and compare data from disparate sources, for a single geography. We also aim for this platform to be visual in nature, capturing data via maps and other appropriate visualisations.
Howindialives.com is presently in beta version and is scheduled to become paid in early-2017. At that time, we will have 250 million data points, more than 5,500 metrics, and at least 600 data points for each of the 715,000 geographical locations in India.
In addition, we also offer data and technology consultancy services to companies, media outfits and non-profits. Some of our clients: Mint, HT Media, Confederation of Indian Industry (CII), Daksh India, CBGA India, Centre for Policy Research (CPR), ABP News, TARU Leading Edge, Arghyam, and Founding Fuel.
A lot of our common understanding of government programs and other public initiatives is primarily driven by the data points that media dailies choose to write on especially in the context of large datasets. Given your experience as a journalist across media entities, what is your take on this and is there a room for improvement?
When media covers the release of new data, the coverage is often superficial, and unable to take into account the complexity of a dataset. The classic example here is the census. The census has been releasing data since 2012 or so. Often we have found that when these data releases are covered, it is only to the extent of state level data or national level data.
We feel that this is insufficient to take into account the vast geographical complexities of a country such as India. Often we have found that drilling down to a greater depth (e.g. down to district level), gives us greater insights, since disparities within states can often be as dramatic as those between states (e.g. see this link and the below map for an example of how, on one measure, disparities within states are often equally important ). This is just one example. Another way is that relationships between variables and how one variable can ‘cut’ another and to explore interrelationships between them. Our exploration of differing education levels among dalit castes is an example.
Another weakness is simply a lack of awareness of what is out there and to think creatively about what datasets can be used to address a particular question.
Then there is the problem that many datasets, while publicly available, are difficult to access – e.g. they may be in pdf format and/or scanned images. For instance, we explored the relationship between the real estate expansion in Gurgaon and the political regime, using a dataset that was in the public domain, but in PDF form, ensuring that it was unlikely to be used by journalists.
It’s important to stress that these weaknesses are not necessarily due to the incapacity of journalists themselves – indeed if that was the main problem it could be more easily addressed through training etc. The problem is deeper and is related to the way in which many media organisations work. Because of the pressure of deadlines, the need to publish on a regular basis, and the extremely short cycle of news, most journalists simply don’t have the time to be able to spend time with a dataset and understand its complexity (this is true of all news reporting – not just ‘data journalism’). The role of senior editors in giving their journalists the time they need to report and explore a piece of data, and insulating them to an extent from the daily news cycle, is crucial.
How would you characterize the demand for Data Visualization driven Journalism as opposed to the more traditional forms such as anecdotal evidence and story-telling?
The demand for data visualisation – journalism is certainly high, and it’s driven by the increasing availability of large datasets, and the tools to explore and visualise them (e.g. R for data analysis and d3 for javascript driven visualisations). These are supply-side factors. The proliferation of online news outlets, social media etc has provided the demand-side push.
Given the vast volumes of data that is now both available and accessible, data visualisation is increasingly becoming an ideal mode to digest and understand such data. In your experience of having worked on large datasets and visualised the same, what according to you are the fundamentals of a good visualization effort? Are there any nuances that someone has to keep in mind while balancing breaking down complex data and its visual presentation?
The best data journalism is one that combines all forms of story-telling and does not restrict itself to any one. Good data journalism is one that explores a dataset, covers the views of people who understand the field or area relevant to that data, and makes it clear what the data can and cannot tell us. (An excellent example of reporting that does this is the series done in The Hindu on rape cases )
Note : Also see answer to following question
When visualising a large dataset, the decision to not focus on certain data points is equally important as choosing what to. Is there a method or an optimum way that one can make these and/or choices? How much of this decision is based on the target audience and the end impact you want to have, say, on public policy for instance.
Any good data visualisation is driven by a clear viewpoint, developed from exploring the data and an iterative process between posing broader questions and seeing what the data throws up. Simply putting the data out there and assuming that it ‘speaks for itself’ as many claim, will almost guarantee that people’s engagement with it will be low.
The argument is often made that ‘imposing’ your viewpoint on the data is a no-no since it introduces bias. This ignores the fact that the very act of selecting how the data is to be shown and what to display and not to display, introduces bias anyway (otherwise we could just dump a giant excel file on users and ask them to figure it out for themselves). It’s better to take a clear line on what your data shows, and make your assumptions and line of argument clear. Readers are often smart enough to reach their own conclusions on whether your arguments pass muster.
Once you have a clear line on what you want to say, you would necessarily organise your data in a way that makes the point. It’s also often helpful to have a few paras of introductory text, talking about the visualisation, and the argument, since this sets the context in which users ‘read’ the viz.
If the target audience is a layperson, who has less domain knowledge of the relevant field, it’s more important to take them through the viz and the arguments you are making, especially if the field itself is complex. If the audience does have domain knowledge, you can certainly assume some familiarity of the subject.
How India Lives has taken an interest in disseminating data relating to socio-economic issues. What have been some of your personal experiences in working on public datasets with respect to the Indian context? Can you give some examples?
- Data is in forms and formats which make it difficult to parse (e.g. pdf, indeed we have seen a case where the data for download – from a government site – was an excel file, but which, when opened, contained only a scanned jpeg image of a data table. The site admin obviously had to fulfil a requirement that he disseminate the data in excel format, but had his own creative interpretation of what that entailed.
- Data is geographically incompatible. For instance, census data is based on districts as of census 2011. Since 2011, however the carving out of new districts, means that adding to that data is difficult without knowledge of how new districts map onto old ones. Further, the very concept of a ‘district’ differs depending on the public authority. For instance, police forces can have their own definitions of what constitutes a district, usually under the jurisdiction of an SP or DCP, and this is different from what the civil administration regards as a district. Thus, mapping crime data to other socio-economic data becomes a challenging exercise.
- Lack of clear GIS data. In India, there is no official, publicly available source of data, that is easily accessible, on GIS maps for the country, which remains updated to reflect latest geographical boundaries, both internal and external (e.g. has the government changed its maps to reflect the recent treaty with Bangladesh? If so, has this been released?).
- Data is in silos. Data released by one government department doesn’t necessarily map onto data released by another in geographic terms. (See point 2 above)
Despite this, our experience of working with public data has been hugely rewarding. The data can be complex, often confusing and maddeningly so. But taking the time to understand its complexity yields rich rewards in terms of understanding diverse socio-economic phenomena.
What is your take on visualising some of the new and non-traditional data inputs that are currently available? Also we are witnessing the movement towards a more “open data” architecture driven by the government, for instance through data.gov.in, that provides vast volumes of public data, what is your take on this?
Any tool, new or not, is only useful, when it is able to provide clear perspective on questions that the user/client is concerned with. Such tools include dashboards which allow the user to ‘cut’ the data in various ways, can be a very useful technique which allows the exploration of complex datasets. They also cover statistical techniques which, if used with knowledge of the underlying assumptions, can throw light on patterns within the data that are not immediately apparent.
As for the movement towards open data, this is a great move and data.gov.in certainly stands out among the range of government sites, in terms of ease of use. But individual departments and ministries should have a clear policy on releasing data at periodic intervals. Until this happens, the open data policy of the government will be implemented only partially.
With the increasing digitisation of public services, citizen level data trails are now being created and captured in government-created databases. What, if any, of these kinds of data do you think should be in the public sphere and what are the measures to be taken for data protection and privacy?
Data at the level of the individual citizens, such as names, mobile numbers etc, are obviously highly sensitive and should not be released, except under restricted circumstances (e.g. for researchers with the stipulation that they release data only in aggregated form). If released to the public, the data must be anonymised in a way that makes it difficult to trace the original identity of the citizen. Such data can also be released in more highly aggregated ways – e.g. at the level of a tehsil or district.