A day in the life of a data journalist can be seen as looking at spreadsheets and presenting information in a meaningful way, however, as the Data Journalism handbook notes via several contributors, Data Journalism is important for the following reasons:
- It helps filter the flow of data
- Providing a new approach and techniques to storytelling
- A different form of journalism, like there is words journalism or photojournalism
- Data journalism is the future of consuming content on the web
- Updating your skillset
- A remedy for processing information
- An answer to data-driven PR
- Providing independent interpretations to official information
- Dealing with data deluge
- Time-saving activity
- and more…
Idrees Khaloon, a recent Harvard graduate in Applied Mathematics is a Data Journalist at the Economist responsible for working with beat journalists, section editors, developers and designers to source and produce data visualisations, cartography and infographics supporting journalists’ stories and ensure the best representation of data in all formats (print, app and web) with the view of developing longer view editorial products and stories.
Given his interesting intersected role in the organization, On the 27th of January, Idrees ran a live Q&A session on Quora. Below is the outline of the session and a summary of the questions and answers.
- Data journalism – a typical day in the office
- How The Economist crunches data to cover stories
- Polling and polling errors
- Some of the stories I’ve worked have included:
- Modeling the results of Brexit
- Working out whether newspaper readership could predict support for Donald Trump
- Data Journalism Career Advice
Data journalism – a typical day in the office
Firstly here is the life cycle of a data story:
- Idea generation
- Identifying existing data sources
- Cleaning and wrangling the data into shape
- Exploring the data, often a bit aimlessly
- Testing your hypotheses for interesting conclusions or building a statistical model (usually just explanatory; predictive models are much harder)
- Writing up your findings, which is always supplemented with conventional reporting
- Last of all, responding to editors and fact-checkers before publishing
In a typical day, a data journalist won’t do all of these things—but he or she will be doing a few of them.
The most challenging assignment I’ve probably taken on is probably building our. After one of my colleagues developed the framework for the model, which takes into account such things as hot streaks and weather effects—in an Excel sheet no less—I had to translate the prototype into Python. Then we had to figure out how to simulate tournaments under this model, which was not trivial. After a week or two of battling, we had the program working well enough to simulate past tournaments 10,000 times. Despite my best efforts, Python, which is an interpreted language, wasn’t getting nearly the speed we needed. So we turned to a colleague with a physics PhD, who managed to translate my Python into C++—improving our speed by an order of magnitude or more. Very fun.
A lot of work goes into our charts before the visualisation magic happens (the data gathering and processing in R and Python that I’ve mentioned). Once the cleaned data is ready, we have two bespoke charting tools that we use to create charts: an Excel script and an Adobe Illustrator script that converts the data into an actual chart.
How The Economist crunches data to cover stories
So, once I have a promising data set in hand, I clean it up and get it into analysable shape using Python’s pandas’ library or R, which is the more popular choice among the data journalists here. Once the data is tidy, I will usually explore a bit: look at averages, find if any values are missing or weird, graph some trends.From there, we would decide on the right charts to accompany the story. These I mock up on my machine and then pass off to a data visualiser to bring into our famous style.
What makes the Economist unique is that there isn’t a data journalism section in the business, it is everywhere. Second, as a weekly paper, we have luxurious deadlines in comparison to our friends at dailies. Producing data stories usually takes quite a bit of time, in part because of the time it takes to clean and process messy data. We’re lucky enough to be able to take our time with stories and give them a properly rigorous treatment before publishing.
Comments on polling and polling errors
The basic answer, to put it a bit boringly, is biased and unrepresentative samples. Polling works if, and only if, the sample represents the whole population. There are all kinds of problems that get in the way of this gold standard—non-response bias (certain people are likelier to respond to your questions than others) or self-selection bias (conducting a poll in a country club would skew your sample, for example).
The raw data that most pollsters work with is usually quite skewed. For example, the sample might be 60% male when the actual population is more like 50%. To fix this, pollsters apply weighting, which would make the female responses worth more. This works pretty well unless there are sudden realignments along uncontrolled axes in politics, which might be what happened last year.
Another area for improvement might be turnout projections, which usually lazily rely on exit polls from previous elections or self-reported likelihoods. Fancier models, involving individualised predictions, are probably needed. Campaigns in America already have a head start on this sort of work—often backed up by very clever data scientists—and pollsters might do well to learn from them.
Example of the stories Idrees Kahloon has worked on
Modeling the results of Brexit
The biggest difficulty of modeling Brexit was that there was no analog we could use to train on. My colleague James Fransham and I got around this by looking at polling microdata to get a clear sense of the best predictors for voting Leave or Remain. Immediately, we could see that education and social class were incredibly good, whereas predictors of political behavior that had worked well in the past (like party affiliation) did exceptionally poorly. Once we’d identified the most important factors, we used census numbers to project the final tallies. We also modeled turnout using a similar procedure.
The election-night model used all of this number-crunching as a base prediction (a Bayesian prior). As the results came in, we wrote a script that dynamically adjusted the underlying model, making it increasingly accurate as the night went on. Unfortunately for the United Kingdom, but fortunately for our model, we were predicting a Brexit within an hour of results coming in. You can see a bit more, including the glorious statistical details,
Newspaper readership support prediction of Donald Trump
It doeswell. If you ask a voter how trustworthy they rated several newspapers, you can predict their vote with 88% accuracy. That’s without incorporating any other helpful information like race, party affiliation or education level. While it might be a triumph for the statistics, I think it’s a bit dispiriting that attitudes toward media are polarised so strongly along partisan lines.
What is the best way to prepare for a career in data journalism?
Knowledge of three things is needed to be a good data journalist: statistics, computer science and writing. Writing broadly and journalism specifically is best learned by doing. If you’re interested in journalism, the best way to prepare is to intern for your local newspaper and try writing for your school’s magazine or campus paper. Another avenue is the trade press, in which you specialise in a niche field but pick up all of the basic skills needed to write on any subject. It’s much easier to learn from experienced journalists than try and read up on this stuff. Most of the staff at The Economist never formally studied journalism, for example.
Statistics and computer science are best learned in the classroom, from an experienced instructor who can iron out mistakes before they’re too deeply ingrained. If you’ve already completed your formal education, there’s no shortage of online materials and courses that can help you. For a rigorous introduction to statistics, I’d recommend reading Joe Blitzstein’s and Jessica Hwang’s excellent Introduction to Probability (and working through the problems!). With that base, you’ll find that a lot of topics, like econometrics and machine learning, will become much more accessible.
Most coders are self-taught these days. As with writing, the most important thing here is doing. Pick a language (Python tends to be easiest for beginners), set things up, and try building simple programs. The more you force yourself to write code, the more natural it will become.