There are many different types of sites that provide a wealth of free, freemium and paid data that can help audience developers and journalists with their reporting and storytelling efforts, The team at State of Digital Publishing would like to acknowledge these, as derived from manual searches and recognition from our existing audience.
Kaggle’s a site that allows users to discover machine learning while writing and sharing cloud-based code. Relying primarily on the enthusiasm of its sizable community, the site hosts dataset competitions for cash prizes and as a result it has massive amounts of data compiled into it. Whether you’re looking for historical data from the New York Stock Exchange, an overview of candy production trends in the US, or cutting edge code, this site is chockful of information.
It’s impossible to be on the Internet for long without running into a Wikipedia article. With articles that range from fully sourced and references historical biographies to timelines of the near and far future, it’s safe to say that Wikipedia has cemented its status as a free web-based encyclopedia. Between the entry that serves as the general overview of the subject and the many books and online references the site provides, Wikipedia is a writer’s best friend in many respects.
3. Common Crawl
As can be surmised from the name of the website, Common Crawl searches or “crawls” the web for data that it then stores and builds in an open repository that users can access. For two examples of what is possible with this site, virtual patent markers and comprehensive lists of websites offering RSS feeds provide a small sampling of how powerful this application is. If there are data or site comparisons that you want to make, this is an accessible tool for creating original information.
EDRM, short for Electronic Discovery Reference Model, is a site for legal professionals dedicated to realizing the potential of e-discovery and the rules and expectations surrounding how information is governed. In addition, EDRM members work together to develop collaborative standards, software, and educational tools that are designed to further the community’s goals. To glean information about the ways in which technology can and has been changing the procedural and administrative aspects of legal practice, this is the site you want to visit.
Mahout focuses on a piece of software by the same name that attempts to figure out the logistics of building an environment that’s capable of creating high-performing machine learning applications that can be scaled and created quickly. For researchers who wish to compile and manipulate their own datasets or try their hand at machine learning applications, this piece of software is especially useful. This site will have individuals well on their way to proficiency with this software.
The Lemur Project is a database that focuses on supporting research on retrieving information and handling human language technologies. With web pages numbering roughly 1 billion and 10 languages collected from January, 2009 to February 2009, the sheer amount of material present and support makes it an excellent resource for researchers. Between all of this and the added support that can be found on the site, anyone who has an interest in technology and human languages will have plenty to work with on this site.
Project Gutenberg is a directory that features public domain novels, papers, and other works. The site’s 54,000+ eBook collection ranges from well-known materials such as the likes of Shakespeare, Mark Twain, and Jane Austen to lesser-known works by more obscure names like Henri Bergson and Samuel Butler. Whether grabbing a classic novel for the sake of being well-read or doing research on how people experienced life in the 19th century, Project Gutenberg is an excellent resource.
This is a website that houses a full dataset containing the audio features and metadata of approximately 1 million popular songs. In addition to the primary million song dataset, there’s also a number datasets that the community has contributed in related categories such as cover songs, genre labels, and lyrics among others. Music historians, hobbyists, or researchers who want this information will be able to sort the data with relative ease. This may very well be the most extensive dataset on this subject matter on the entire Internet.
Everyone knows Amazon as a digital retailer, but did you know that Amazon also hosts free public datasets that are open for anyone to access without having to either store or download anything on their own devices? With data that spans from weather, space environment, and meteorological information to imagery focused on developing algorithms that aid in computer vision, there’s no shortage of options for those who want a more convenient way to analyze massive amounts of data.
In the interests of promoting more transparency, getting more citizens to engage, and encouraging dialogue, the Government of Canada offers extensive data as part of its Open Government initiative. On this site you can find datasets on government-related issues such as the capacity levels of the homeless shelters in Canada as well as regional numbers on the participation-levels of Anglophones and Francophones in the public sector. With access to datasets of this nature, there’s no need to depend on other people’s statistics to find information.
11. Data Catalogs
Data Catalogs, now Data Portals, offers users a convenient site for browsing open data portals from all over the world. With the portals being assessed and curated by various levels of governments, a number of NGOs, and even the World Bank, the data available for analyzing is extremely high quality. Users have the option of browsing or contributing data portals. From the standpoint of research, the variety of subject matter and information makes this site an especially convenient place to begin a search for information.
Data.gov.uk is a site that allows individuals to find and access data that various public bodies, government departments, local authorities, and government agencies publish. Here researchers can find information on the economic climate for small businesses, trade, imports, industry, and exports or even do research on payments over £25,000 made by government departments. With the site explicitly stating that the data can be used for research, the information covered here may even generate more ideas as researchers go through it.
This site is where the US Government provides open data that the public can have access to in the form of datasets. On top of the raw data, the site also offers a number of tools that can be used to make data visualizations as well as build applications for the web and mobile. Make no mistake. The data is immense with information ranging from credit card complaints to federal student loan program data in over 197,000 datasets. This site offers plenty of opportunities for innovation and comprehensive analysis.
DataSF offers hundreds of datasets in relation to both the City and County of San Francisco. Interested in seeing what local and regional lobbyists have been pushing for? Do you need statistics on crime? Browse the Showcase tab to see what people have accomplished with the data or use the form to make a contribution. Made with Open Data and offering an academy, a blog, and a number of other tools, this site is driven in large part by collaboration and community. This makes it an asset for researcher.
DataFerrett is different from a lot of sites in that it isn’t a repository or directory so much as a tool that allows users to customize data from local, state, and federal sources through data analysis and extraction. This tool allows users to create customized and comprehensive spreadsheets and then turn the same information into a map or a graph without having to download or enable any other software. Organizing massive data inputs and turning it into something that’s easy to read has never been easier.
Through the University of Maryland, Inforum makes US economic data available to the public. Many US government agencies have contributed to this site to the point where the site now holds thousands of “economic time series”, as it calls them, and these contain numbers on industrial production, price indices, labor statistics, and business indicators. The data is freely available and can be accessed with either a personal laptop or desktop. Researchers who want to get a good look at the raw, economic data have a resource in Inforum.
According to the site’s own numbers, Europeana’s collections account for a total of over 50 million records. Using the curated datasets here, researchers can find the information they’re looking for in less time. The datasets here include categories such as 3D models, Italian World War I maps, and even a collection of over 20,000 historic photos from Lithuanian museums among others. For either general historic searches or as a starting point for going through Europeana’s massive records, this is an excellent resource to have.
On top of its non-stop coverage of breaking news and events, the Guardian also has an entire section devoted to data blurbs. The pieces here range from serious topics like the effectiveness of housing policies on homelessness to slightly more light-hearted subjects like which countries have the most Nobel prize winners. Journalists and researchers have no shortage of information to use in their own projects from this site. With the help of a quick search, it’s possible to find data on just about anything.
Hosted by the National Center for Biotechnology Information, the Gene Expression Omnibus is a site that contains “public functional genomics data” that’s compliant with MIAME (Minimum Information About a Microarray Experiment) standards. The site also accepts data that is arrayed or sequenced while providing the tools necessary to find and download the information. Those interested in studying genomes or acquiring information on the subject will have all the data they need here and then some.
Long recognized for its contributions to innovation and progress in the social sciences fields, the University of Chicago’s Center for Spatial Data Science (CSDS) explores the next frontier with its foray into spatial analysis and technology. The work of the CSDS has applications in virtually any field that has to contend with space in approaching the issues. Consequently, fields like environmental economics, public health, and criminology have all benefitted from these applications. The dedication of CSDS to open source software and distribution of its information make the data if provides even more accessible.
Through the use of data collected by the University of Koblenz-Landau’s Institute of Web Science and Technologies, KONECT (Koblenz Network Collection) offers research done in the field of network science and its related subjects. The project uses a series of its own software network analysis tools to crunch numbers and produce drawn plots and algorithms. KONECT then hosts the results of its analytic work directly on the website. With over 200 datasets to choose from, this is a resource that’s worth exploring.
MIdata is a site that acts as a repository for data that’s supposed to be used by machine learning data. These datasets can range from a compilation of human facial expressions to more scientific topics like predicting how molecules will bond. With entries split into categories that offer access to raw data, tutorials in the material and methods section, as well as learning tasks and challenges, this site allows researchers to parse through the repository for datasets that are of interest.
The NASDAQ is a world-famous stock exchange that has long been an excellent resource for journalists and researchers in search of data from the world of finance and business. Here you’ll find information on IPOs, historic price data, and the breaking financial news that makes this site a go-to online destination for financial data. NASDAQ Composite offers paid data options as well for those who wish to do a deeper analysis. This is a very respected and well-established resource.
Dating back to the moon landing, by now everybody’s heard of this government agency and its forays into outer space. Of interest to journalists, however, is how NASA is also a valuable source data through its Space Science Data Coordinated Archive. Here, researchers are able to find space science mission data in categories such as astrophysics, image resources, and heliophysics among others. In addition, there are also numerous white papers available on the site to go with the new data being submitted.
Socrata is a site that takes the government data that’s available and puts it into a format that makes it easier for people to analyze, click through, and find the information they’re looking for. Designed specifically with the needs of non-technical individuals such as public policy wonks, researchers, entrepreneurs, and concerned citizens in mind, Socrata uses the cloud to compile data from a variety of sources. For journalists trying to understand the effectiveness of different policies, this is useful platform.
Quandle is a site that offers primarily economic and financial data formatted with the needs of investment professionals in mind. Relying on over 500 information sources from credible organizations like CLS Group, the UN, central banks, and Zacks among others to aggregate its data, this data source is perfect for researchers and journalists who want to get the big picture at a glance. Thanks to the site’s Excel add-in, accessing the data directly has never been easier for users as well.
Carnegie Mellon University has a well-deserved reputation as an excellent academic institution. What many people don’t know is that Carnegie Mellon’s StatLab is a useful resource for journalists in search of data. This dataset archive includes data on issues such as the MLB salaries of North American players in 1986 as well as data that’s designed for use in evaluating the accuracy of statistics software. In exchange for acknowledgment, these datasets are available for public use.
The UC Irvine Machine Learning Repository, referred to as UCI, is a site that stores a ton of interesting data that journalists can use. Home to 394 daatasets as of this writing, the site has the added advantage of having an interface that’s easy to search. Some of the more popular datasets include information on “Human Activity Recognition Using Smartphones”, wine, and bank marketing among other subjects. In exchange for using all of this data, the site merely asks for a citation.
If you’re a journalist who is looking into the development of machine learning, then the UCR Time Series Classification/Clustering page will make for some excellent reading. The site provides a helpful briefing document that will provide you with all of the background information you need to know. Along with an overview of what the information contains, the site also offers the ability to download the data directly. Just remember to use the citation format the site asks for if you use these datasets.
30. US Census
Need statistics on population wealth? Want to know the exact gender breakdown of a particular field happens to be? The US census is a site that has all of this data and more available for public viewing. Sort data by year or region, and you’ll quickly be able to find the statistics that most people didn’t even know were factored into the US census the way they were. These numbers were available in Excel and Microsoft Word as options which make the data even more accessible for journalists.
31. Wolfram Alpha
Wolfram Alpha is actually a computational engine that allows users to input the data they want to know and receive a calculation. The engine does statistical data and analysis, chemistry, dates and times, and even words and linguistics among other things. For users who are attempting to uncover new ways of handling data, this is especially useful because of how it’s able to just spit out new calculations at the press of a button. Journalists in particular stand to gain a lot by using this as a supplementary resource.
It turns out that Yelp is more than just restaurants and user business reviews. This user-driven review site also keeps a dataset that gives researchers access to reviews, user data, and businesses for “personal, educational, and academic purposes”. Going by the company’s count, that’s 4.7 million reviews and 156,000 businesses in 12 metropolitan areas included in the dataset. With those numbers, the materials and trends researchers could potentially discover in this data might be a pleasant surprise.
33. Data World
Want to have a list of removed Facebook pages? How does being able to sort US economic data by county sound? Data World is a site that allows people to share, host, collaborate, and keep track of data. The site even includes a section for journalists outlining the reasons why Data World is useful for members of the profession while also pointing out the hosting capabilities to a streamlined FOIA-predictor as well as pages designed to help with organizing. All in all, this is a solid mix of data and data-hosting.
Run and operated by the CIA, the World Factbook gives you information on the societal structures, history, military, and economic situations for 267 countries along with maps, flags, and a set of time zones following the materials in the world map. The site offers a thorough and in-depth look at the subject matter in a way that goes beyond the basics. In short, this is a data source that should be in every journalist’s arsenal.
Managed by the US Department of Health & Human Services, HealthData.gov offers the public access to “high value health data” in hopes of capturing the attention of entrepreneurs, policy makers, and researchers. In the areas of product and service development at least, people have been able to examine this data and get results. Journalists who want to be on the cutting edge of health data or who are vetting a statement that a health-care official has put out can use this site to find answers.
This is a site that lends instant credibility to journalists who use the information it offers. The statistics that UNICEF covers include those relating to issues of health and human rights such as education, maternal health, child poverty, water and sanitation, and child disability among many other categories of statistics that are kept. It’s useful for researchers because it’s up to date and backed by one of the most well-known organizations on the planet. Journalists can’t go wrong citing this data source.
The World Health Organization is an international organization that gathers health statistics and information throughout the world. Aside from the information that can be found directly on the homepage, the site also offers data through the Global Health Observatory. This data includes information on the steps countries are taking towards universal health care, health research and development among other categories. Journalists will find lots of information on outbreaks, health emergencies, and healthcare coverage from an international perspective here.
With the availability of Google Public Data, journalists are clearly able to rely on Google in more ways than one. The search engine juggernaut has public data available and out there for analyzing with over 100 public datasets to its name. Data subject matter ranges from the extremely serious with World Development Indicators and Human Development Indicators all the way to the interesting with data on the most dangerous roads in Europe. All a researcher has to do is run a search and see what Google Public Data has.
39. Gap Minder
Gap Minder offers data on a number of local and national indicators along with links and information on all of the data providers. Using this site researchers can see information such as how old women are when they marry for the first time, statistics on alcohol consumption, and causes of death in children. For journalists who are writing with an international slant or who are doing comparative data, this is an excellent resource. This is a useful source of data regardless.
40. Google Trends
Google Trends is a tool that gives researchers insight into what people are looking for right now at this instant. Researchers can compare the data to the trends that have occurred in the past and can also use the tool to make estimates ahead of, for example the holiday season, to see what will happen for searches in the future. Google trends offers graphs, hot topics and plenty of opportunities to uncover the news before it’s officially news.
41. Google Finance
Google Finance offers a quick and easy opportunity to do a more in-depth search on a company that investors have been raving about. Easy ways to filter technical indicators and review the latest news about the company in one simple, straightforward window that allows you to sort information even more. In addition, it’s free. For journalists who want to research the finances of a traded company, Google Finance offers an intuitive interface with which to access this information. Unfortunately, Google has recently discontinued some of the core features such as finance portfolio. Here are some alternatives to Google Finance.
Anyone who’s ever wished for an easier way to run Wikipedia searches has reason to be excited about DBpedia. Powered by the commitment of the community, this site seeks to make it possible to run more sophisticated searches against Wikipedia content. With the English version boasting 4.58 million entries with classifications and associated categories, the site is well on its way to offering comprehensive coverage based off of the information in Wikipedia. Journalists can’t go wrong with this data source.
43. Pew Research
For many, Pew Research is in the upper echelon where surveys, reports, and research data is concerned. The site covers topics that range from political opinions to social trends and developments in various workplace industries. Pew Research also has a search function that makes it easier than ever to access information. Journalists who want up-to-date statistics and findings that come from a source that is trusted and reputable can’t go wrong with turning to Pew Research.
44. Broad Institute
For journalists who want to find out the latest news in relation to cancer, Broad Institute’s datasets could be the perfect place to find the information. This also includes information on additional subjects such as Bioinformatics & Computational Biology as well as brain cancer and molecular pattern discovery. In short, this site gives journalists a leg up in terms of finding in-depth data on cancer to make stories out of the data provided by Broad Institute.
UNdata offers information on different countries around the world. This includes data such as technical indicators, social indicators, and economic indicators for each country involved. For journalists that are working on human interest stories or stories that could benefit from being substantiated by some additional statistics and data, UNdata is the ideal choice. The accuracy of the data as well as the UN’s reputation make this a data source that journalists can count on while doing research.
46. Google Scholar
Imagine if instead of scrolling through websites, it were possible to pull up a search that had nothing but peer-reviewed papers and academic materials. Google Scholar makes it possible for people to find journal articles, white papers, and publications by the world’s leading scholars. As is usually the case for this company, Google Scholar is as intuitive as it gets with the user merely being required to enter a keyword to get the ball rolling. Searching for academic papers has never been so straightforward.
Known most commonly as “the front page of the Internet”, Reddit is one of the most popular websites on the Internet. On top of being an accurate gauge of what’s happening online, the site also has a subreddit, or a subforum, that’s devoted to datasets as well. Users are able to request datasets, post resources, and have discussions on working with data through formats like JSON. Researchers stand a gain a lot from perusing this data source.
Qlik DataMarket makes it possible for you to collect and handle data from external sources. This platform allows users to borrow across several datasets with the option of cross-referencing it against the data they already possess in order to refine their sense of greater context. Better yet, even though this is a paid platform depending on the subject matter, there’s a free option with the Qlik Datamarket as well. Journalists exploring the data can do so to their heart’s content.
Hubspot has always been a thought leader in the who’s who of marketing for business. From the standpoint of doing research, this is a site that will tell researchers everything about what’s going on in the industry as well as what people within the marketing industry are talking about right now in real time. Journalists are able to use this site to learn more about the trends. On that note, Hubspot is a great resource for researchers.
Perhaps unsurprisingly, the Bureau of Justice keeps a ton of statistics. At the Bureau’s website you can find numbers on arrests, inmate deaths, execution by capital punishment, law enforcement statistics, and censuses of the jails. The criminal justice system is a subject of constant fascination for both the public and the people involved with it. That’s what makes the Bureau of Justice’s statistics even more useful for journalists who are doing research into the criminal justice system.
The Uniform Crime Report is a collection of statistics on property crime and violent crime that’s gathered by the FBI. While law enforcement agencies from throughout the US have been reporting this data since 1930, the findings have been published dating back to 1958 can search the . Journalists who are looking to explore the crime data have the option accessing and using the UCR data tool to explore the information that’s available on this site.
Uniform Crime Reporting is the result of a program that was thought up by the International Association of Chiefs of Police in 1929. The numbers gathered by the FBI here are published four times a year. On top of the information provided by the UCR program, the site also includes reports on hate crime statistics, Law Enforcement Officers Killed and Assaulted (LEOKA), as well as the results and numbers provided by the National Incident-Based Reporting System.
NACJD, or the National Archive of Criminal Justice Data, is a site that draws information from datasets such as the Uniform Crime Reports (UCR) and the National Crime Victimization Survey (NCVS) and then stores and distributes the statistics. Designed to be curated, stored, and maintained for ultimate accessibility, the data comes in several forms including experimental, qualitative, and longitudinal. Ultimately, this offers journalists and other researchers another way to visualize and access criminal justice statistics.
54. First Databank
First Databank is a site that deals with drug data. The site seeks to promote more efficient and more data-driven decision-making in the area of pharmaceuticals. This allows doctors and clinicians to begin thinking about pharmaceutical drugs in different way through the use of First Databank’s innovative use of technology. From a professional standpoint, this site is especially useful because of how its data can help teams adjust as new information comes. At the least, this is a useful resource for journalists writing in the pharmaceutical space.
The FDA, known as the Food and Drug Administration, is the agency that’s responsible for protecting public health through the supervision and approval of drugs, food products, supplements, vaccines, and cosmetics among other consumer products. As a resource, the FDA has datasets available for the public to peruse while also providing technical data for people who are comfortable working with spreadsheets and analyzing the information that comes from the datasets. This is definitely a useful resource for journalists.
Ever wondered about exactly how much the country pays in the wake of a drug epidemic? Are there rumors of people consuming drugs differently than before? Drugbase offers a database that’s chockful of statistics on the trends and the usage of drugs in the United States. There are infographics as well as publications on topics like comorbidity of addiction and mental illness or facts on drugged (not drunk) driving. This is a resource that provides enough information to spot trends and make comparisons against past data.
UNODC, or the United Nations Office on Drugs and Crime, has a website devoted to the furtherance of its goal to help member states adopt stronger standards of research, data collection, and forensic. On this site, researchers can find numerous statistics and publications that cover subjects like data collection, trend analysis, and research programs where possible. A resource that’s full of information on a variety of forensic-related topics as well as the science of the subject.
58. Drug War Facts
Drug War Facts is a site that offers extensive discussion of the war on drugs as well as the consequences of the policy. This includes statistics and numbers on details like comparisons between the cost of treatment as opposed to the cost of relying on law enforcement, on numbers on drug control spending estimates, and a slew of information on just about every topic related to the war on drugs. For many people, this is the most comprehensive site on the web with respect to the war on drugs.
The National Centre for Education Statistics, often referred to as NCES for short, is the place to go for any and all education-related statistics. This site has statistics on the state of student lending, projections of education trends, along with datasets and comparison tools that can be used for doing more in-depth analysis. Journalists can use this resource to uncover trends, verify public statements, review the National Centre for Education Statistics’ publications, and find new stories in the data.
60. World Bank
The World Bank hosts numerous statistics and data compiled by the Development Data Group in the financial sector as well as the macro-economics. It’s possible to sort through data by using hashtags. Users can choose between a variety of indicators and make a selection by country in order to review the different measures of developmental progress. As such, this is a resource that anyone looking into the financial and/or economic state of member countries can benefit from having access to.
The Bureau of Labor Statistics is a journalist’s go-to source for numbers and statistics as they relate to current working conditions, what’s happening in the labor market, as well as how prices change and affect the US economy. With the agency’s statistical work dating back to 1884, there’s no shortage of economic data there for researchers to peruse. The site stores the information in a user-friendly interface and constantly updates the data that’s available for searching. This is a data source worth exploring.
62. The Numbers
Blockbuster releases get a lot of media attention, but it’s hard to tell how well a company has actually done without numbers. Enter “The Numbers”. This website offers research and data for the film and entertainment industry. Researchers can explore revenue estimates, expectations for upcoming releases, and other investment data via OpusData’s SQL-based search engine capacity. The Numbers is the first place or researchers to visit for reliable statistics on movies and films. That’s what makes it an excellent resource.
63. Film Forever
Film Forever is a site that researchers can visit for market intelligence and data for the movie industry in the United Kingdom. Here users can find weekly box office numbers for the top 15 UK releases, audience research, reports, case studies, and the organization’s flagship Statistical Yearbook. In addition, the site also has a calendar that keeps viewers informed about when the next statistics will be released. Film Forever’s niche makes it a particularly worthwhile data source.
IFPI is a site that prides itself on having a finger on the pulse of the worldwide recording industry. Users will find published reports full of insights into recorded music, national and global sales data, as well as reports on the business side of the music industry that show how the companies are investing in music. These reports allow users to see what’s happening. This site will keep researchers up to date on what’s happening in the music industry in real time.
Statista is a search engine like Google, only instead of webpages the site returns data and statistics. With a single push of a button, users can get immediate access to over one million statistics and facts. Users will find infographics, statistics on China, the food industry, consumer markets, and, for a fee, dossiers and industry reports are available for viewing as well. Whether looking for information on the economy, social media, or the Big Mac, this is the place to do it.
The EPA, which is short for the United States Environmental Protection Agency, is the government agency responsible for protecting people and the environment by enforcing the laws that are set up and passed through Congress. On the EPA’s website, users can look through a number of datasets on topics that range from agriculture to subjects as narrow as annual releases on toxic chemicals and waste management methods. This site is an excellent choice for journalists who want access to raw environmental data.
This website for the Centres for Disease Control and Prevention bills itself as a “one-stop shop for environmental public health data”. At this site, researchers will find references and lists to data systems that receive national funds while tracking and storing information that relates to concerns of environmental public health. With a focus on programs that operate at a national level and accessibility through direct download capabilities, this is a resource that can be counted on for the latest and most accurate information on the web.
Established after the merging of three previously-independent agencies, the National Centers for Environmental Health is the place to go for high quality information on the environment. Offering comprehensive data that ranges from ocean data to ice records from millions of years ago, if the issue involves the environmental, chances are this website will have information on it. The agency’s commitment to accuracy and excellence in its stewardship of one of the largest archives of its kind also make it one of the few sites online that possesses, updates, and maintains this type of data.
The National Oceanic and Atmospheric Administration’s National Weather Service will tell researchers everything they need to know about the weather. This site offers data searches that include information on categories like warnings and forecasts, climate, geographical forecasts and more. In addition, this comes with an intuitive, easy to follow map with tabs that can be clicked on for different results. Whether reviewing what happened locally or finding the forecast for a city in a different state, this site will uncover information quickly.
Wunderground is a resource that’s dedicated to making sure that information on the weather is available to everyone around the world with attention also being paid to areas that don’t receive a lot of coverage. Wunderground explicitly states that it has taken steps to ensure that the user experience is excellent on multiple digital platforms. This means that the site is accessible through mobile as well as through PC, making it an ideal resource for journalists who are on the go.
Weatherbase provides information on current conditions, averages, climate information, and travel conditions for over 40,000 cities around the world with the help of a simple search bar. Use the companion site the site links to in order to find additional travel information to the tune of currency converters, coordinates, and county information among other fun facts. Weatherbase can also be used to find places to vacation purely on the basis of what the weather will be like. Happy searching!
72. Energy Atlas
Published under the International Energy Agency, the Energy Atlas presents researchers with the ability to see the world through energy statistics. Originally designed to be a complementary data source from the date of its original inception, the site boasts an animated Sankey flow energy balance as well as several databases to go with the publications that can be perused on the International Energy Agency’s statistics page. Researchers will find both this site and its companion sites extremely useful while researching the ways in which countries and cities use energy.
The Bureau of Economic Analysis, or BEA for short, publishes a broad range of useful information that allows researchers to keep their proverbial fingers on the pulse of the nation’s economy. On this site, there are numbers on US economic accounts that include numbers on consumer spending, GDP, and fixed assets among other useful data. Researchers can search by region or industry as well as by level with international, national, and regional search options. Try the interactive data page to find out more about the bureau.
The website of the National Bureau of Economic Research, or NBER, is a data source that approaches economics from an analytical standpoint. It hosts data on a wide range of economic topics with such entries as the Index of African Governance, the Official Business Cycle, Experimental Coincident, Leading and Recession Indexes, and the Macro History Database. NBER has official datasets published and compiled under its own name as well as indexes compiled by other publishers.
The United States Securities and Exchange Commission is an agency that acts as a watchdog of sorts in promoting transparency, fairness, and efficiency in the markets. Interestingly enough, the site has a financial statement dataset dating from January 2009 to October 2017 with updates being made every quarter. Researchers can rely on this site to stay on top of the latest news as it relates to filings and the information it can tell you about companies and the state of their finances.
The International Monetary Fund, also known as the IMF, is a well-established organization in the international economic and financial sector. On the website, researchers can find a host of data on those subjects. Users are able to search datasets by indicator and country and browse the charts and maps while doing research. Popular datasets include direction of trade, primary commodity prices, Financial Soundness Indicators, surveys, and International Finance statistics among other items of valuable information.
Originally conceived by Harvard, the Atlas is an online tool that allows people to visualize and interact with a company’s trade situation. Atlas will then take the information and propose different products that the country could potentially manufacture in order to improve their economy. This is a tool that’s used by policymakers, businesspeople, investors, and engaged members of the public who want to have a better understanding of the economic climate of a given country. Questions of trade and national economies have never been more accessible.
78. Doing Business
Doing Business is the result of an effort to make objective evaluations of business regulations. The site examines nearly 200 economies and numerous cities measuring such details as economic indicators as well as ranking the ease of doing business. This site allows users to examine the effects of various types of business regulations between countries and hosts reports as well as extensive qualitative data. In addition, the site also makes it possible to make comparisons over time.
Originally a project of the United Kingdom’s Department for Business, Energy, and Industrial Strategy in conjunction with the Department for International Trade, Comtrade is an excellent resource. Borrowing data from the United Nations’ Comtrade Data, the site provides users with an interactive chart that can be used to search, compare, and analyze the exact numbers of the trade and goods between countries. Just select the reporting country, choose a partner country, and make selections as much as possible.
Global Financial Data is a source that doesn’t just compile standard financial data, it takes financial information dating from the 1200s to now. This information is derived from a variety of sources including books, archived materials, academic journals, and news periodicals. In addition, the site has datasets that utilize the chain linking statistical method. The end result, from the user’s perspective, is a resource that’s like no other on the Internet by virtue of its exclusive data.
Visualizing Economics is less a resource in the data discovery sense of the term and is more of a service that focuses on designing information graphics and interactive dashboards. In addition, Visualizing Economics also does analysis and design for the express purpose of making economic data easier to understand. Through this site, journalists have a legitimate opportunity to work with a professional who has years of experience translating economic data into something more accessible to the general public.
The EU Open Data Portal is a project that was set up in the aftermath of a decision made by the European Commission. On this site, EU institutions are offering data for public viewing and use without copyright restrictions and available with no charge. Datasets include the CORDIS reference data, the transparency register, and even a full list of the people, entities, and groups the EU has issued financial sanctions against. In addition, the data’s available in a number of digital formats.
83. Open Data Network
The Open Data Network is a site that allows users to look up data by region and city. Sporting a clear and intuitive homepage on the site, researchers have the ability to search by data category, city, and even by sample questions. On each page, after going through either the data categories or the sample questions, there are convenient links to even more datasets as well. The organization of data alone makes the Open Data Network a site that’s well worth exploring.
The Landmatrix is a site that offers an online database for land deals with the intention of promoting more transparency on acquisitions. Essentially, this tool can be used to visualize and make sense of the various land deals. The data is always improving, changing, and being adjusted in order to improve the accuracy of the information made available. To date, the Landmatrix has information on over 1,000 deals. It’s a source worth exploring for researchers.
The United Nations Development Programme hosts a lot of useful data on human development around the world for the public to explore. With dates generally spanning from 1990 to 2015 in a lot of these datasets, the indexes include full tables such as trends in the human development index, the gender inequality index, and the life-course gender gap. Researchers can search the data directly through the search bar and also go by country if the intention is to go through the chart.
The OECD, known as the Organisation for Economic Co-operation and Development, has a site that’s focused on aiding governments in anti-poverty initiatives and prosperity through economic stability and growth. On this site, researchers will find peer reviewed materials, publications, as well as standards and arguments in favor of setting standards. The OECD also hosts a factbook that provides a solid economic reference tool to go with a number of surveys and predictions on economic outlook that can be found on its pages.
The US Department of Health & Human Services operates a site that provides information on the President’s Council on Fitness, Sports, and Nutrition. With facts and data compiled with the assistance of several experts in related fields such as chefs and athletes. In addition, the site also has a host of statistics. Researchers can find facts on the physical activity of children, the muscle-strengthening habits of adults, as well as information on the dieting habits of the public, as well as obesity among numerous other facts and statistics.
Partners in Information Access for the Public Health Workforce is a project that came about as a result of public health organizations, US government agencies, and libraries specializing in health science. Topic pages on this site include such subjects as grants and funding, health promotion and health education, and literature and guidelines. Through the Public Health Topics section, there’s also data on subjects such as bioterrorism, public health genomics, and dental public health to name a few subjects.
For the last three decades, the United Health Foundation has been providing information on health rankings for use as a means of measuring public health. The site hosts numerous reports and publications that include reports on the health of those who have served, senior reports, women and children’s health, annual reports, and even briefs on important topics to the field. Use the interactive map to explore by region and learn more information. There’s also a search bar for further navigation if researchers are looking for something more specific.
In the United States, Medicare is the primary means that a lot of people rely on for health insurance and access to medical treatment. Along with the services it offers in real time, Medicare also offers data on standards and quality of treatment across facilities and hospitals via its comparison chart and rule. It’s the official dataset used by the Hospital Compare website and it’s full of data that can be downloaded into Excel for further ease of access.
Surveillance, Epidemiology and End Results, also known as SEER, has a site that’s especially useful as a source of information on statistics on cancer. It hosts statistical summaries that allow for research on the numbers associated with cancer that can be sorted by the site of the cancer, the ethnicity, race, age, sex, and even by data type. The site also hosts publication, datasets, and software that can be used by researchers for even deeper analysis.
Amnesty International is an organization has long been an advocate for human rights and justice around the world. It also happens to host a lot of data on the status of human rights around the world as well as information on specific atrocities and crimes against humanity at different points as part of its annual report. Researchers can use the information to make comparisons between different years and to see how different countries have evolved or regressed in the area of human rights.
Since its conception 25 years ago, the Human Rights Data Analysis Group has been applying scientific principles to human rights violations in different countries around the world. The site hosts publications that have been published in reputable media outlets such as the Washington Post and formal publications through Macmillan publishers sorted by year. Along with its organized publications going back years, there were also projects occurring all over the world. For a more technical look at human rights violations, this is a great search.
This site hosts databases compiled by numerous reputable organizations, universities, and even government agencies. Examples of these would be the Manifesto Project, the Minorities (at Risk) Project, the Comparative Welfare States District, and the Armed Conflict Database. There are some projects like the Polity IV Project that go back to the 1800s. Meanwhile, projects like the Stockholm International Peace Research Institute (SIPRI) measure arms transfers, international military spending, and security trends. The best way to appreciate the data would be to head to the site and explore.
The Uppsala Department of Peace and Conflict Research, often referred to as UCDP, hosts a massive database called the UCDP Conflict Encyclopedia. This is a site that allows users to click through and explore the data the department has already disaggregated. Researchers can be clicked on through the website and also downloaded for further manipulation and analysis. This is a resource that can be counted on and referenced for quality information distributed in an accessible manner.
The United States Department of Labor hosts a lot of economic data concerning statistics on unemployment and employment. Naturally, these numbers include databases that include mass layoff statistics, employment projections, job openings and workplace turnover, national employment statistics, and even international labor comparison statistics. The site provides information that’s up to date and accurate while the Department of Labor keeps track of it all. This is a reputable resource with government backing for the purposes of research.
The Small Business Administration has long been a proven resource for entrepreneurs and other aspiring entrepreneurs. This site hosts a ton of statistics on employment as well as information that allows researchers to do market research and competitive analysis. Here researchers can find numbers, statistics, and tools that can be used to uncover additional data. For information on small business statistics from an employer and business perspective, this is an excellent resource that journalists can turn to at any time.
Crowdpac is a platform that allows political candidates to fundraise and organize. Drawing heavily from the idea that there are a number of congressional candidates each election that basically run unopposed, this site allows engaged citizens to organize support. With articles discussing relevant political issues like gerrymandering to go with additional topics like civil rights and national security, this site represents an excellent opportunity to understand and find out what’s happening in the grassroots political scene.
This site is home of the famed Gallup polls. Gallup specializes in analytics that allow organizational decision-makers to solve problems through a data-based approach to problem-solving. Furthermore, the device suggested by Gallup is often useful for driving solutions. This is a source that has recognition as the gold standard in data and advanced analytics. Just browse the site to explore reports on everything from the state of the global workplace to discussion of US productivity.
100. Berkeley Library
Berkeley Library hosts a full compilation of statistics and data for Political Science research on its site. On this page researchers will find a ton of links that provide researchers with access to a number of datasets as well as the capacity to build your own. Among these are the Historical Statistics of the United States (HSUS), the Millennial Edition, the Data Planet, ProQuest Statistical Insight, and the Inter-university Consortium for Political and Social Research. There’s several hours of data to get through.
101. RAND State Services
For those who don’t know, the RAND Corporation is an organization that specializes in research into public policy challenges. With clients and a portfolio that spans all levels of government, the corporation is a source of quality research for the purposes of decision-making. The US branch of the corporation hosts a set of database statistics on its website. Here researchers can find information on K-12 education, health, business, and economics among other categories that address issues that are relevant to the public good.
Run and operated through Cornell University, the Roper Center for Public Opinion Research specializes in collecting, distributing, and preserving public opinion data. As an example of the sort of information the Roper Center can uncover, researchers can have access to data from the US election as well as a link to an archive of over 23,000 datasets. Whether journalists are looking for public reaction to politics or to a recent health scare, this site is almost certain to have information.
103. Transportation Gov
Powered and operated by the Bureau of Transportation, this site has data that spans a broad range of transportation-related subjects. Resources on this site include reports on energy, passenger safety, energy, system performance, transportation economics, infrastructure and freight transport. Users can even sort and access the data offered on this site by location and geography. This is a site that allows researchers to find out everything they could possibly expect to know about transportation-related topics.
104. Travel Trade
Travel Trade is a site that hosts data concerning US citizen departures dating from 1996 to 2016 as of this writing. The stated goal of this resource is to help interested members of the general public process and understand how global tourism and international tourism has operated over the years. Available both for download as well as for online viewing, this is an accessible piece of information. Researchers can easily use this data to find trends and make comparisons.
Skift is a site that focuses on providing intelligence and data to the travel industry. Among other sources of data, the company hosts research, conferences, and informative newsletters for subscribers and researchers to choose from. Skift examines topics that individuals in the travel sector would want to know such as where people are increasingly travelling, identification of new markets, and a lot of additional information on travel technology that researchers have the option of exploring.
Geoba.se is the perfect site for people who want the facts and nothing but the facts about a city or location. Using the search engine on the homepage, finding coordinates, information for travel, weather, and even local webcam footage is just a few simple keystrokes away. The site also hosts a page that provides information on world rankings that can be narrowed down by region and country. In short, this is a resource that’ll provide pure data and statistics.
107. US Travel
US Travel hosts a site that’s operated and maintained by the US Department of State. The stated mission of the site is that it aims to protect the lives of US citizens who are going abroad. As such, this source hosts statistics, information, and reports on such topics as US passports, US visas, intercountry adoptions, deaths overseas, and international parent child abductions. The information can be used while planning trips but can also be used to identify long-term trends with the statistics spanning from 1996 to 2016.
108. UK Data Service
Financed by the Economic and Social Research Council, UK Data Service collection publishes a broad range of data. This site has information that includes materials like business data to cross-national surveys, surveys sponsored by the UK government, and even UK census data. Basically, the website was designed with the needs of students and researchers in mind. In addition, there are guides, resources, and instructionals that offer guides and resources that’ll help researchers understand and use the tools on this site quickly.
Run and published by the Australian government, Data.gov.au offers easy access and searching of open data. This site explicitly points out that the government data can be used to develop tools and applications that in turn can be used for the benefit of Australians. Not only is there access provided to the open datasets, but there’s also unpublished data that can be accessed for a fee. For researchers who want to perform an even deeper analysis, the site also offers a Data Toolkit.
Everybody knows Twitter for its fast-paced conversations, short messages, and its status in popular culture as a hub for breaking news. What a lot of people don’t know, however, is that Twitter also has developer tools that make it easier to filter and discover information. These tools even allow researchers to view trends and filter by geography. Whether reading up on trending hashtags or exploring the developer tools, Twitter is a resource journalists have been using for quite some time.
Instagram isn’t purely for liking cute cat pics and adorable baby photos. Or at least, it doesn’t have to be. The app has a surprisingly sophisticated set of developer tools that make it easy to understand and do research on the audience. In addition, hashtags and the clues revealed by the photos people post as well as the individuals who get tagged in them can be treasure trove of information. Instagram is a useful way to uncover what’s trending in different sectors.
112. Four Square
For the type of research where location matters, Four Square is a useful data source because of its massive database and all of the information that it has compiled. On the surface, it has a city guide that provides recommendations for users on the strength of the community. Four Square also has developer tools that allow for additional information access through the Places Database. Journalists can use this to learn more about specific locations and about the people who use the app.
113. New York Times
Considered by many to be an esteemed member of the Fourth Estate, there are very few journalists who haven’t heard of the New York Times. What’s often overlooked, however, is the use of the New York Times as a data source through its API. Researchers can find articles dating back to 1851 by month, search articles, and even find book reviews. This API allows for searching based on views, shares, and emails and even for finding and accessing comments.
The Associated Press has a permanent place in popular culture as a source of timely and accurate news. Thanks to its developer tools, it’s also a useful source of data for journalists. As of this writing, researchers can use these tools to create their own editing while downloading pictures and videos. The level of content appears to depend on the type of plan researchers are using, but the Associated Press API nonetheless allows users to take the research process to another level.
115. Five Thirty Eight
Journalists may already be familiar with Nate Silver and Five Thirty Eight and his statistical model due to his sometimes unexpected but usually correct predictions. Five Thirty Eight has a GitHub that hosts datasets as well as coding that has been used over the course of the site’s history. The datasets feature amusing subjects like data on bad drivers, the Avengers, and the survey on flying etiquette. At the same there are also files that address slightly more serious matters like airline safety and hate crimes.
IMDb is considered by many to be the most comprehensive site on the web with respect to the film and acting industry. If there’s a movie coming out and people want to know who’s acting in it or to see the general reaction of the movie-going public, chances are they’re going to land on this site at some point during their search. IMDb also hosts a number of datasets that are refreshed every day and are available for commercial and non-commercial use.
KAPSARC is a data portal that hosts a total of 923 datasets with specific information on energy data. These sets are divided into a few general themes in energy use, energy supply, and other relevant factors like policies, demographics, the environment, trade, water, ad economic information. For researchers who are interested in energy and how it’s used across different industries and sectors, KAPSARC is one of the most comprehensive energy data sources on the web.
118. Asset Macro
Asset Marco is a site that provides historical financial data and macroeconomic indicators. This data covers more than 75,000 stocks, currencies, commodities, and bonds spanning the world over. In addition, the site has more than 120,000 macroeconomic indicators users can use to explore the financial data of different countries. In addition to all of this financial market data, the site also discusses investment strategies. This source is very unique because of the sheer volume of information that can be found.
The US Government Web Services and XML Data Sources are hosted on a site called USGovXML.com. Here, users can browse through the different XML data sources and web services that the US government has provided. This simple act of preservation keeps those web sources transparent and accessible to the public. For researchers who are regularly monitoring this index in general, it’s possible to find a story in the data in the event that there’s a sudden change to the XML data.
Figshare is a site that hosts over 5,000 pieces of scientific content available for academic research and citation. On top of the information there, the site is designed to offer researchers a single location for the purposes of compiling, uploading, storing, and managing the research that they find. Mathematics, health sciences, engineering, chemistry, biological sciences, and social sciences all listed as featured categories. This site is a great source for journalists in search of more academic resources to site.
LinkedData is a site that’s dedicated to the idea of finding new ways to connect Internet data that wasn’t linked before. Here, users will find tutorials, guides, and data sets that will get the story going. The datasets all focus on the topic of getting involved with the linked data community, and besides the linked data shopping list, most are categorized as dereferencable URIs either with or without the complementary RDF format. To learn more about this community, this site is a must-see.
122. The Web Miner
The Web Miner is the perfect place for researchers who want to collect all the generic data they can find with the program. This site hosts example databases such as US restaurants, SWIFT codes from banks around the world, US gas stations, American tourist attractions, and Google Play apps among other massive lists. If nothing else, it’s a site that’ll make it easier and faster for journalists to sift through and uncover massive amounts of data in significantly less time.
123. Data Hub
Data Hub prides itself on being a place where users can find and publish data as quickly and efficiently as possible. The site itself hosts a number of data sets. The House Price Index (Case-Shiller), the monthly price of gold, and the Current Trends in Atmospheric Carbon Dioxide are the three most popular. In addition to the data, the site also hosts a number of tutorials that users can go through in order to learn more about navigating the various types of data available.
124. Enigma Public
On its site, Enigma Public dubs itself as “the broadest collection of public data” available on the web. The datasets fall into one of four broad categories in FOIA, Essentials, Newsworthy, and Under the Radar. Some of the data on this site includes White House employee salaries and Active Federal Firearm Licenses. After making a free account, users are able to access any one of the categories of data that are there for the viewing.
Most web users are familiar with the name Yahoo due to the likes of Yahoo! News and Yahoo! Finance among the company’s many online properties. Of interest to researchers and journalists, is the fact that Yahoo also hosts a vast number of datasets including Yahoo! Music User Ratings of Songs with Artist, Album, and Genre Meta Information, v. 1.0 and the Yahoo! Movies User Ratings and Descriptive Content Information, v.1.0 to name two. Journalists in search of new statistics can’t go wrong with this source.
126. 1000 Genomes
1000 Genomes is home to a project of the same name that went from 2008 to 2015. The purpose of the project was to find every genetic variation that could occur in at least 1% of the populations of being studied. Along with the publications that came about because of this project, there were also massive datasets that included separate databases of variant cells, raw sequence files, and sample availability. This data can be either browsed or downloaded.
CBOE is a futures exchange that focuses primarily on volatility futures. In particular, the site features plenty of materials concerning the futures that are featured on the site’s trademarked Volatility Index. The site hosts market data of all sorts including historical data, daily market statistics, and VX Futures Daily Settlement Prices. For journalists who are seeking quality market data, CBOE is a site that can provide that information in a format that’s easy to follow and understand.
128. St. Louis Fed
The Federal Reserve Bank of St. Louis is one of, if not the most, important financial centers in its region. On the website, researchers can peruse working papers, economic data, publications, and information services directly. In other words, there’s no shortage of information on the current and past thinking of the St. Louis Fed in terms of policy as well as the ability to evaluate the effectiveness of the St. Louis Federal Reserve Bank. For business, finance, and economic journalists, this is a top-notch resource for information.
OANDA is a popular online stock trading platform, primarily trading in CFDs and the Foreign Exchange. On top of the many features added to the trader with the intention of attracting online traders, OANDA also hosts a lot of historical rates data as well as historical information on the currency converter on the site. Along with all of this data, the site also offers information with investment strategies along with news and market analysis. An account isn’t even necessary for accessing most of this data.
The Australian Bureau of Statistics, or ABS, not unlike its American counterpart, offers objective data, economic information, and research on a broad range of topics that are relevant to the country. Directly on the site itself, researchers can look up statistical data on business indicators, health care, housing, finance, International Trade, housing, mental health, as well as price indexes and inflation. Journalists can run searches to find older surveys and information that can also sort information by region.
131. London Database
Originally conceived and operated by the Greater London Authority, the London Database is London’s attempt to make London’s data more accessible to the public. The end goal is to give people access to this information while encouraging them to use it for free in whatever way they want. On this website, users can search data by topics such as Arts & Culture, Crime and Community Safety, Education, and Health. Journalists who are interested in this type of data can now get it directly from the local government.
132. Stats NZ
The government of New Zealand hosts a ton of statistics and data for researchers to dig into and analyze on this site. This information can be sifted through using the search bar at the top, by filtering for location and region, as well as by topic. Some of the topics include economic indicators, health, income and work, industry sectors, environment, and business. Between the additional news sources and releases highlighting various findings and statistics, journalists will uncover all sorts of New Zealand-specific statistics through this site.
Run and operated ultimately by the Government of Australia, the Australian Bureau of Meterology’s website features weather information as it pertains to the various cities and regions of Australia. Per the site, this agency was established as a means of helping Australians cope with the climate around them through a combination of warnings and advice. Here researchers will be able to find seasonal outlooks, water storage, rainfall forecasts, climate variability, and seasonal streamflow forecasts. At this site you will find accurate and reputable coverage on Australian weather.
This site is on the web courtesy of GroupLens of the University of Minnesota’s Department of Computer Science and Engineering. The site offers publications as well as datasets for research purposes. There’s a total of about six datasets. Among the named sets, there would be a few entitled the Book-Crossing, MovieLens, and HetRec 2011. In short, this is a useful resource for journalists who are seeking to better understand how to use the data provided.
135. KD Nuggets
KD Nuggets is a site that focuses primarily on providing people with data science, business analytics, machine learning, and data mining. There’s a page on the site that has a complete list of datasets that people use to do more exploration of data mining and big data with datasets like Bioassay Data, Asset Marco, DataMarket, Casualty Workbench, Data Ferrett, and Datamob all being linked to. This is a fantastic resource for journalists who prefer having all the information on one page.
Everybody who’s used a PC or a laptop has probably heard of Microsoft at least in passing. Interestingly enough, on top of PCs, laptops, and software, Microsoft also hosts a lot of research and publications. This includes breakthroughs such as the company’s quest to create literate machines as well as cloud-based data science. There’s also additional information on tools Microsoft is developing like Visual Studio Code Tools and the developments in AI that they represent.
Exactly like it says on the tin, R Datamining is a resource on R and datamining. The site provides numerous example and documents that give an in-depth perspective on data mining and data mining with R. In addition, there are also links to training courses such as the short course offered by the University of Canberra. This includes links to free datasets and presentations as well as datasets that cover subjects like airplane, airline, and route data as well as links to site like GeoDa.
Collaborative Research in Computational Neuroscience, also known as CRCN, has a number of datasets that can be accessed through their site. The datasets are categorized by the various parts of the brain such as the visual cortex, the hippocampus, the motor cortex, avian, eye movements, and aplysia as just a few examples. These folders also include challenges, tools, simulations, and methods. The ability to share this data makes it an even better resource to use for research.
Per its website, Protein Data Bank archive has been a premiere resource on nucleic acids, the 3d structures of proteins, and complex assemblies since 1971. Formed with the explicit mission of keeping this information in the public domain, researchers can go here to view validation reports and data dictionaries online. There are also data growths and usage statistics available for web-based sorting and analysis as well as for download. Best of all, the site is always adding new information.
141. The PubChem Project
PubChem as an official project was designed for the purposes of informing the public about what small molecules are able to do from a biological standpoint. The site is linked to by three databases including PubChem Compound, PubChem Substance, and PubChem BioAssay. In addition, the site also makes it possible to search for the similarities between different proteins. For researchers taking their data analysis to the next level, the site also offers free coding and tips.
142. Coremine Medical
Coremine Medical is an invaluable resource for anyone searching for information on biology, health, and medicine. Now that the biomedical text mining capability of PubGene has been rolled into its current form, Coremine is also one of the most flexible sources of biomedical information around. This site will display links between concepts and ideas in a visually engaging, easily understood format that may not have been noticed otherwise. It’s easily one of the most comprehensive biomedical data sources available to journalists.
143. Tu Tiempo
Tu Tiempo is an incredible source of weather and climate data for every country in the world. Using this resource, it’s easy to find annual, monthly, and daily averages for virtually every city and region in the world. In addition, users can also search through the database of over 115 million records full of historical data that any person can search through. Depending on the region being searched, it’s possible to find data that goes as far back as 1929.
This is a site that provides access to quite a bit of the data that was first used in its computer-based experiments. The full list of datasets that list the types of data including news graphs, biological graphs, citation graphs, collaboration graphs, engineered graphs, and semantic graphs. The page also links to a list of sources that contain a lot of information such as the dataset that examined roughly 3 million US patents. The page also boasts an impressive compilation of Complex Network datasets.
Scopus is a tool that allows individuals to quickly and easily find research and academic citations. The site offers an incredibly extensive database of research that has occurred around the world in a number of fields that include sectors such as medicine, technology, social sciences, and the arts and the humanities. Use Scopus to capture academic source that might’ve been overlooked. After all, in many circles, the quality of an academic source can be almost as important as the information it provides.
Stanford’s reputation as a prestigious academic institution didn’t happen randomly out of the blue. The excellence shows through in its programming-related courses. The site also hosts a number of datasets that include details such as social network information. There are datasets centering on the social circles on Facebook, Wikipedia admin request, Twitter social circles, and Google +. Communication networks and the Amazon Product Network also have their own datasets.
147. University of Milano
The University of Milano’s Department of Information Sciences runs and operates a web page known as the Laboratory for Web Algorithms. This site is home to plenty of datasets that are there for the exploring. These include graphs in relation to social networks, Facebook graphs, snapshots from the DELIS project, and a short list of miscellaneous data. The information available here can be viewed online and downloaded if so chosen, making this one of the most accessible datasets of its kind on the web.
The UCI Network Data Repository is a site that’s dedicated to taking a scientific approach to the study of networks. On the resources page, researchers will find links to dataset directories selected by research organizations and groups as well as by individuals. It also has a collection of datasets that would typically be used for social media analysis. Those digging into the data will be pleased to find that these sets are also available for download.
CAIDA, or the Center for Applied Internet Data Analysis, collects a wide range of data from a number of different locations, often with the assistance of different organizations and individuals. There are datasets hosted on this site like AS Relationships, DDOS Attacks, Telescope and its related ones along with other data. The categories include traffic, topology, security, worm summary, and traffic summary statistics. Datasets may require request access, but many, if not most, are public.
Crawdad, or the Community Resource for Archiving Wireless Data At Dartmouth, is unique because of its focus on providing wireless data to researchers and others who may have an interest in the subject. The site offers a number of tools as well as access to numerous datasets. Among the sets listed are those referred to Educational Use, Bit Error Characterization, Network Diagnosis, Opportunistic Connectivity, Location-Aware Computing, and more to select. Researchers will appreciate this resource the more they dive into it.
Often referred to as the EIA, the US Energy Information Administration is in the business of providing annual electricity utility data to the public. The information in this data covers fossil fuel stocks, fuel consumption, monthly and annual information on the generating of electricity, and environmental data among other options. The data is there and available for analysis dating from the years 2001 to 2017. All researchers have to do is navigate onto the site and download the information.
Funded by the National Environment Research Council, British Oceanographic Data is one of the most accessible sources of marine data on the Internet. With an extensive database that touches on currents, CTD profiles, international sea level data, currents, and even historical bottom pressure recorder data. In addition, there are datasets to be found in the Published Data Library which offers additional access to the catalogue. This is quite possibly one of the most extensive sources of marine information available online.
Factual provides location data for advertising and for use on mobile platforms. Of particular interest to researchers are the developer tools that include the Engine Mobile SDK and the full professional and research applications of the Observation Graph as well as the Local Validation Stack. With a website moniker that emphasizes the company’s passion for takin data around the world and finding new ways to put it in context, Factual has a clear commitment to data and finding new and unorthodox opportunities to use it.
Global Administrative Areas is a geodatabase that shows where the various administrative areas in the world are situated. The data gathered from this type of database is then typically used in geographic information systems. These would include countries and is further divided into provinces, counties, and departments among others. The good news for journalists is that all of this data is available for free and can also be used for academic and general non-commercial use.
Geonames is a site that’s home to a geographical database with millions of entries, unique features, and alternative names. Offering both an export option and access through a variety of web services, this is a database that processes approximately 150 million requests each day. Thanks to the database’s wiki capabilities, users are able to make adjustments and changes to the database entries with relative ease. This is a great resource for the multi-language hosting capabilities alone.
156. Natural Earth Data
Natural Earth Data is a map dataset that’s available in the public domain and full of information designed for use in map-making software for the creation of state of the art maps. The visuals of the final product are neat and well-organized and the data can be used immediately. This dataset includes the presence of intelligence data and various cultural, raster, and physical vector data themes. Originally made with the needs and preferences of cartographers in mind, this dataset is useful to anyone with an interest in geography.
157. Openstreet Map
Openstreet Map is less a website and more a collaboration between users that is now providing mapping services to apps, sites, and various hardware devices. This site acquires new data when users enter information on lesser-known landmarks such as railway stations, roads, and trails. The full dataset is available free of charge on the site and can be downloaded either in full or in part. For those opting to do a partial download of the data, it’s possible to download by region as well.
158. City of Chicago
The City of Chicago is the home of Michael Jordan’s championship Bulls and its own unique style of pizza, and it also has a full data portal of its own. Dataset categories span a variety of topics that include Administration & Finance, Ethics, Health & Human Services, Parks & Recreation, Public Safety, and Historic Preservation. In short, the City of Chicago’s data portal hosts virtually anything that would be of interest to researchers, policymakers, and local journalists.
CKAN is essentially the online home of the City of Glasgow’s open data project. This site has datasets on numerous subjects that are useful for entrepreneurs, policymakers, academic researchers, and app developers to utilize. Out of the 360 datasets hosted here, some are related to city governance like the house stock by tenure dataset while others like the cycling dataset are of particular interest to local residents. There’s all sorts of information here for journalists who are covering a more local beat.
160. Government of India
The Government of India has a website that covers analytics and data resources in its version of the Open Data Project. Currently, there are roughly 137,940 resources that have been viewed millions of times on the site. There vast majority of these files are also available for downloading on the site. Whether looking for numbers on the government budget or searching for datasets that address health and family welfare, chances are this site will have resources to offer.
161. Stats SA
This site is full of up-to-date statistics, publications, and data gathered by the South African government. Here researchers will uncover information on everything from food and beverage surveys to economic indicators, employment statistics, population numbers and important health statistics. It’s possible to search the numbers by city, theme, and indicator depending on what’s needed. This site hosts a lot information on the census while also releasing statistical publications, questionnaires, codes and classifications, and pricing policy.
This site is published under the umbrella of the U.S. Department of Housing and Development’s Office of Policy Development & Research. It publishes a large number of case studies, bi-annual publications, and periodicals regularly each year. It also offers a large number of datasets that journalists would be interested in with Fair Market rents, Income Limits, Renewal Funding Inflation Factors being just a few of the sets the public has access to on this site.
At Vital Net Health Data, researchers will find plenty of large health-related datasets. This site is not so much hosting all of these sets so much as offering links to sets that people can visit and find information through. This curated list links to resources like CDC Wonder, Eurocat, Health Data All Star, and also the work of charitable organizations such as the North American Association of Central Cancer Registries. This is hands down one of the most comprehensive health dataset resources out there.
164. Analytic Bridge
Analytic Bridge is a resource that’s dedicated to business intelligence. Here researchers will find discussion on machine learning and AI, links to webinars and conferences, and even a job search tab. The site also hosts Data Science Central, which is the part of the site that focuses on big data. With its active and engaged community and its commitment to providing news and information, journalists with an interest in the implications of data for business stand to gain a lot from this.
Known primarily for its efforts to become an online public library, archive.org is home to numerous published works as well as a substantial dataset collection. The site boasts results from the 2012 Internet Census as well as Dark Net Market archives from 2011 to 2015, and even a dataset of public Reddit comments. There are data dumps from Music Brainz and a dataset that contains audio cover images. Between its publications and data, archive.org has plenty of material for journalists to go through.
166. Academic Torrents
This website refers to itself as a system designed for making it easier to share and download huge datasets. Making use of torrent technology to simplify the distribution of data, Academic Torrents prides of itself on allowing researchers to download everything they need quickly. The site also hosts papers, courses, and collection for viewing. A quick search through the resources available will reveal that there are tons of datasets and collections available for downloading here.
The best way to approach Dataverse is to think of it like another type of library. Here, researchers can search for, discover, and cite data with ease while simultaneously using this site as a repository for their own information. The subject matter covered includes fields such as the social sciences, the agricultural sciences, medicine, health, and life sciences, as well as the earth and environmental sciences. Big names with publications on this site include Gallup and the US Department of Commerce, Bureau of Census, Geography Division.
168. UC DATA
Operating in conjunction with UC Berkeley’s Social Science Data Lab, UC Data is the university’s biggest and most well-known archive. This site provides offerings in the areas of statistics and social science data. On this site researchers can access the papers, reports, and working papers produced by the UC Data researchers. The raw data covers numerous research areas that include Health Care, Welfare and Social Insurance, Demographics, Voting, and Information Technology among a host of other topics.
169. Joke Camp
Joe Kamp offers a full guide to finding soccer and football data and APIs for the purposes of data analysis. If researchers follow the links provided on the page, there’s open source data available through GitHub as well as access to free and commercial APIs for the purposes of easier access. Since the data and coding is available on a well-recognized site like GitHub, getting a hold of this sort of data has never been easier.
170. Sean Lahman
Sean Laham isn’t necessarily a name people are hearing every day, but his site is home to one of the most comprehensive and in-depth batting and pitching statistics on the Internet. With numbers covering the period from 1871 to 2016, the data literally goes back centuries. Data is free to access and use under the Creative Commons Share Alike 3.0 license and can be downloaded directly in SQL and Microsoft Access to name a few. The statistics can also be downloaded via GitHub.
171. Retro Sheet
Retro Sheet is one of the most extensive sources on the Internet for baseball statistics and data. The site includes details like annual rosters and identification of umpires, players and coaches. For the years that it was relevant, the data for the all-star game was included in the event files along with a set of event files for the post-season and a small discrepancy file. Retro Sheet even has identifications for ball parks for each season. How’s that for thorough?
For those who aren’t as familiar with the program, the Hubway is the name of the bike-share based in the metropolitan area of Boston. Of course, the system didn’t record and release identifying information, but the Hubway nonetheless has the basic information on every trip that was ever taken between July 2011 and September 2012. This included details like the start and end of the trip as well as the pick-up station to name a few categories.
173. Open Flights
Open Flights is a database that has information on more than 10,000 ferry terminals, airports, and train stations around the world. Researchers can find the Excel-compatible, .csv version through GitHub and can also download the data directly on the website as well. Using the map on the homepage, it’s possible to see which specific places are on the list and the site even goes so far as to have route information available as well. The site owners can be contacted for even more updated information.
MLVIS is a data repository that combines visual analytics with data mining in real time. This makes it possible to explore more intuitive understandings of data even while working with huge datasets. Benchmark data and non-relational machine data learning along with different data types such as attributed and heterogeneous are among the many features and options available through this site. For the added convenience of users, this information can also be downloaded into a single consistent format.
175. Open Data Inception
Open Data Inception is a site that offers links to well over 2600 data portals. By making use of the search bar on top, researchers can search for portals and datasets by category and by theme. In addition, it’s also possible to use the site as a means of finding the most up-to-date version of the dataset being searched for. Take advantage of the ability to view data portals in list format or in interactive visual form and start finding the necessary data.
Available in French, English, and German, OpenDataSoft is a source that offers access to 480 million records, 4 million API cells, and 9,284 datasets. Using the search bar in the middle of the homepage, researchers can enter a keyword or category and find the most appropriate dataset from there. For journalists, this is a faster way to find the most relevant datasets needed to complete the research in question. Visit the site to learn more.
NationMaster is a source of fully compiled data from over 300 countries that has been organized in over 5,000 categories. The data covers numbers that include numbers on the percentage of deaths that have been registered, World War 2 statistics, and even information on nuclear war and testing. Researchers will also find tables, graphs, and pie charts that will allow for further visualization of the data. Put simply, there are so many subjects covered that there’s always something new to find in the data.
Twitter has long been a popular social media site for breaking news and finding trending stories. Followerwonk allows users to take their Twitter usage to the next level. This includes finding Twitter users to connect with, studying current followers, and planning Twitter activity for maximum results. These days there are a lot of reporters and journalists on Twitter who are using the site for networking and getting stories out there. Followerwonk makes Twitter users more productive on the site.
Infochimps is a site that offers cloud-based services that can be scaled back for the purposes of getting the most out of big data. It’s useful when it comes to deploying and integrating big data technology and applications. When researchers are searching through massive amounts of data or evaluating trends in big data, this is an invaluable resource to have. There are also numerous white papers and cases available for researchers to view on the site.
Founded in 2006, Archive-It is a service provided by the Internet Archive. This service helps organizations and businesses create digital collections and as a result it has had opportunities to work with non-profits, colleges, universities, and governments. Researchers can search a few of the different archives on the site such as websites from the 2014 congressional candidate race, the Alabama State Archives, and the Canadian Government Information PLN Web Archive. This site is a treasure trove of information for enterprising journalists.
181. Civic Commons
Civic Commons has a page that lists the various government open data initiatives. This searchable list of resources is organized by country, city, region, and even makes mention of the resources made available by intergovernmental organizations. For journalists, this site represents a faster way to find out which governments are participating in the Open Data Project. This site also grants access to pieces of localized data that wouldn’t necessarily come up in a simple Google search.
The Guardian is a famous name in the world of journalism for its reputation for breaking news. What less people realize is that the site has a section that offers data on and about governments around the world. There are articles on the impact of homelessness numbers, discussion on cyber-security, and even thoughtful discussion on the role that data and statistics have to play in the current political and social climate. The Guardian’s World Government section is capable of jumpstarting discussion and finding angles for stories.
This site belongs to a group via the Open Knowledge Foundation with the goal of encouraging and supporting the continued development of open government data. Here, users will discover links to one of the most extensive lists of open data catalogues available. Among the additional goals mentioned on the site, the group also seeks to find information on policy, best practices, and guidelines as well. It provides journalists with extensive access to more and better information.
This website is the online home of the open data project offered by the Government of France. It’s possible to dig into the data by searching under categories such as employment, agriculture, education, travel and tourism. This is data that allows for building and developing a more nuanced understanding of what the data actually says while also leaving room for comparisons based on the historical information. Basically, journalists have every reason to be excited about going through this data.
This site stores the research data available through the University of Notre Dame’s use of SourceForge.net. The data is offered through relational databases. The monthly data dumps also make it possible to gain a better understanding of open source software and its applications. In order to access this information, requests for access must be made in writing over email. The catch, however, is that scholarly and academic researchers are the only ones eligible for access to the data.
186. UFO Reports
The National UFO Reporting Center has an online database detailing people’s experiences with unidentified flying objects. Researchers can streamline their database search by using any of four categories in the date, the shape of the UFO, the posted date, and even by state. UFOs are unique because they never fail to capture the imagination of the public. If there have been any recent encounters of the third kind happening nearby, this is the place to find out what people have been saying.
Notorious and infamous in media due to the controversies and what the leaks have revealed about the inner workings of government and other famous and powerful figures in society, WikiLeaks has a reputation that precedes it. Although the data dumps are rarely ever dropped quietly, nobody ever questions the accuracy of the information. For journalists in search of stories that will instantly draw interests, WikiLeaks is a proven source. If nothing else, it’ll make for interesting reading.
188. The Washington Post
The paper is already known as an excellent source of breaking news and opinion pieces, but few people know that the Washington Post grants access to the raw data that’s often mentioned in its articles. On the data page, researchers can find data in categories such as education, the census, health and safety, transportation and development, historical World Cup databases, and even numbers pertaining to government and politics. Put simply, having access to these numbers helps people develop a more concrete understanding of the issues in the news.
189. Climate Data
Climate Data is a dataset that provides comprehensive information on global temperature. In the current format, users can see every important piece of climate information through the grids while also being able to see what the averages are. For those searching for the companion data, it’s possible to get access to the same information for land and ocean as well. This information can be downloaded, but for the sake of convenience, it can also be viewed directly on the site as well.
190. Protein Structure
Protein Structure is a source that seeks to examine how computer networks can be used in conjunction with biology. The page hosts a repository with data that can be accessed through the links provided. Of particular interest for members of the research community is how the site incorporates several ideas like model analysis and executable biology into its pursuit of this goal. For journalists, this site is well worth looking at to observe progress and examine data.
With the help of this site, users can take a course in analyzing survey data without having to pay for the privilege. Analyze Survey Data Free with its detailed Table of Contents, includes sections sporting titles like Maps and Art of Survey – Weighted Maintenance, Balancing Respondent Confidentiality with Variance Estimation Precision, Structural Equation Models (SEM), and Complex Survey Data. The site offers a great refresher for those who anticipate handling more statistical data in the future.
At the UCLA wiki site, researchers will find a number of datasets available for the purposes of demonstration. There’s plenty of simulated and observed data to choose from. Using these resources it’s possible for people to use this resource to uncover climate data, population data, biomedical data, neuroimaging data, US census data, election data, and economic data among numerous other categories. Ultimately, these datasets are a resource that a lot of people can benefit from using.
On its site page, the University of Toronto offers researchers access to what it calls the Delve Datasets. These collections of data were part of a larger product designed for the purpose of making comparisons between the learning methods. Ultimately, this information is there for the development and evaluation of the different approaches to learning. In short, this is a solid source for researchers who want to better understand how to analyze and handle datasets.
The Natural Resources Conservation Service has a site that concentrates on promoting conservation while offering information on the different mosses, hornworts, vascular plants, lichens, and liverworts present within the United States. This site hosts a full database of plants and images of plants that can be found on the site to go with tons of information. Researchers can download the database and find tons of information on topics such as alternative crops. Essentially, this website has everything folks need to know about plants.
As can be surmised from the name of the agency, this service handles the research needs for the US Department of Agriculture. Whenever an agricultural problem is discovered, this is the part of the government that most likely helped find a solution. The site hosts a number of datasets that can be accessed and downloaded directly. Journalists can also use this site to find all the latest news in relation to the issues affecting agriculture.
196. Cell Image Library
This site offers a public library that offers resources, information, and access to images and animations portraying cells and cellular processes. The cell is designed with the dual process of research and education in mind, the information here is almost always relevant during discussions of public health and disease. The materials come from a combination of sources including historical and modern publications. For a thorough explanation that simplifies complicated biological processes, journalists can’t go wrong with the Cell Image Library.
197. Complete Genomics
This is the site of a company that considers itself an established part of the biotech space in the area of human genome sequencing. Interestingly enough, Complete Genomics has made quite a few of its whole genome sequences available to the public. Ultimately, this offers all kinds of useful insights on DNA and the sequenced human genome. The only condition on this material is that researchers who are using this information take care to give Complete Genomics proper references.
198. Array Express
Array Express is a repository that stores information from the results of genomics experiments that required massive amounts of sequencing or processing. On this site, users will find over the results of over 70,000 experiments to go with more than 2 million essays inside multiple terabytes of data stores. Better yet, this information is free for reuse for research purposes. This is a great resource for all the latest information on genomics and the progress being made in the field.
The Encyclopedia of DNA Elements, or the ENCODE Consortium, is the result of research groups from around the world who are working in collaboration with one another. Ultimately the goal is to compile a list of all the functional parts of the genome that include the close examination of RNA levels, protein, the elements that regulate cells and the activity of genes. There’s data that can be searched through as well as an encyclopedia that offers further information.
200. Ensemble Genomes
Ensemble Genomes is a site established in 2000 that deals with the genomes of vertebrates. Over the years this resource has added companion information on invertebrate metazoan, plants, bacteria, and fungi. The data on all of these subjects can be found and accessed by clicking through the links available on the site. This site has tutorials, datasets on all of the topics covered, and a collection of documents to browse through. All of these factors make Ensemble Genomes a fantastic data source for journalists.
Gene Ontology is a site that exists for the express purpose of finding a way to represent the current understanding of how genes operate by computer. It has numerous publications as well as additional documentation that people can read. There are annotations hosted directly on the site. The good news for researchers who want to take a closer look at the numbers and raw data is that there are files available for download directly on the website.
Harvard Medical School LINCS Center exists for the purpose of helping the research community and the general public learn more about how human cells react when they’ve been perturbated by drugs. Using the HMS LINCS database and the project explorer tool, researchers can find publications and project summaries as well as general resources. Journalists can also use this site to get a hold of all the latest news and information that comes out of this research.
The Human Genome Diversity Project has been making a lot of progress through the efforts of the Stanford Human Genome Center. The site has samples that have thousands of samples and markers. It turns out that these can be downloaded and observed and thoroughly analyzed simply by following the links that have been provided on the page. This is a great resource for journalists who want to understand the information coming from the research community.
204. JCB DataViewer
JCB DataViewer allows those interested in what the Journal of Cell Biology has to say to see the image data associated with the articles published there. The site has a full gallery that people can scroll through in order to see the materials. In addition, viewers also have the option of being able to do further analysis of the data as they peruse the site. Put simply, this site is perfect for understanding the references and figures present in the journal’s articles.
The GDC Data Portal is a platform that’s designed to aid researchers and those in the bioinformation field perform research on cancer more efficiently. There’s an archive, an API, as well as documents available for reading and so on. Access to this site means being able to see the same information that cancer researchers are using to conduct their own research. Here, journalists will be able to find all the data they’re looking for and then some.
The Opensp is a community-powered project designed for the purpose of sharing genotypes. People who have been typed using FamilyTreeDNA, 23andMe, or deCODEme can upload that information onto the site. The purpose of asking people to do this is so that the site can focus on seeing if connections can be found between genotypes and SNPs, or single nucleotide polymorphisms. What’s of unique interest to journalists is that there’s data available for people to download and enjoy looking through.
Pathguide is a site that’s dedicated to providing information on metabolic and signaling pathways as well as the interactions between proteins at the molecular level. This page hosts a list of approximately 697 resources related to the main subject. The databases that are linked to on this site are all generally free to access. Most of the references provided on this resource list focus primarily on protein to protein interaction. This site is an invaluable resource for biology enthusiasts.
208. RCSB PBB
This is a site that’s dedicated to informing academics and the public at large about all things related to nucleic acids and proteins. The RCSB Protein Data Bank offers access to various tools designed to make this aspect of biology more understandable including visualization tools, 3D structure viewing, and a fully searchable archive that can be categorized by organism category. In addition, this site is offers updated news on all of the latest developments in this field.
The Psychiatric Genomics Consortium is the result of collaboration between investigators and scientists from around the world who are working on research concerning the genetic component of psychiatric disorders. Ultimately, this project was able to produce 17 main papers and an additional 31 development papers offering secondary analysis and method with a single landmark paper that came out of it all. The Consortium offers tools, downloads, and access to the findings via the data access portal by request.
210. Pub Chem
PubChem is a respected name in the field of medical and biological research and has been for a very long time. Offering the ability to search structures as well as the Compound, BioAssay, and Substance databases, researchers can’t go wrong with this site. In addition, there are millions of entries present in each of these databases. This information can be viewed through tools like the 3D conformer tools and the BioAssay tools. The data is also available for download.
As the name would suggest, the Catalogue of Somatic Mutations In Cancer, or COSMIC, is dedicated for the chronicling and exploration of the effects of somatic mutations in cancer. The site makes it possible to search COSMIC categorized by cancer type, gene, and mutation. There are tools on COSMIC such as the genome browser and the cancer browser. In addition, there’s also data on gene curation, drug resistance, genome screens, mutational signatures, and gene fusion curation available on the site for download.
The Genomics of Drug Sensitivity in Cancer is dedicated to finding biomarkers that can help doctors identify the type of anticancer drug that patients are more likely to respond to. Journalists can use the news tabs to stay up to date on the presence of new data or changes to the site. In addition, there’s also a compilation of data on cell lines, a database chronicling the features of cancer, and even a list of compounds all available for viewing on the site.
The Stowers Institute for Medical Research’s website offers researching members of the public free access to the data that its scientists, research scientists, and genomics scientists have used for their publications. For the public at large, the institute takes pains to point out that the Stowers Original Data Repository is typically free to access. That being said, some of the largest files in the database may not be accessible directly through the Internet simply and may require additional arrangements.
214. SSBD Database
The systems Science of Biological Dynamics database, typically referred to as the SSBD database for short, provides a suite of tools and resources to be used for the purpose of examining microscopic images and evaluating quantitative biological data. The images found on this site came from a variety of sources and include objects such as cells, single-molecules, and gene expression nuclei. With the data on this site being acquired from computational simulation and experiments, journalists can rest assured that the information here is constantly being refined and updated.
The Personal Genome Project is a site that’s focused on the creation of health, genome, and trait data that’s open and available to the public. Largely continuing the project with the assistance of individuals who have volunteered to make their genomic information public, this site offers the data it has found and successfully acquired to the public for free. Science aside, this project offers journalists with an interesting look at the effects of creating a public record of personal genome information.
216. UCSC Genome Browser
The USCS Genome Browser allows individuals to view genome assemblies. In addition to online viewing, the site also provides links that can be used to download the sequences and annotations for those same genome assemblies. These links are divided into the categories of human, mammals, other vertebrates, deuterostomes, insects, nematodes, other genomes, and other downloads. The tools and directories on this site are also free for personal and non-commercial use. Journalists can benefit from the thoroughness and the accessibility of this information.
The Universal Protein Resource, known by the name UniProt, is the place to go for information on protein sequencing and annotation. Drawing from the information provided by three databases in the UniProt Reference Clusters, the UniProt Knowledgebase, and the UniProt Archive, this site is equipped with peptide and cluster searches among other features. Journalists can use this site to verify, discover, and learn more about new discoveries in the area of protein sequencing and annotation.
The Actuaries Climate Index, also called the ACI for short, gives the general public and decision-makers information on climate trends and the effect of climate change in Canada and the US. Researchers can peruse the information provided by this educational tool to find and discover massive changes to sea and weather. It’s possible to narrow the search by regions and components. This site has decades of data at its disposal and it routinely updates its information quarterly.
The Aviation Weather Center provides accurate, timely, and up-to-date information on weather that the airspace system can rely on. On this site users can view the various graphs, forecasts, and observations on weather framed in a way that aviators can appreciate. It’s possible to view the information provided by the site’s data sources in real time through either .csv or XML output. Researchers can manipulate and observe the raw numbers more closely through this download option.
The Climatic Research Unit’s website is there for the express purpose of performing research on the effects of climate change in the past, studying the causes, and finding solutions to issues of climate change in the present. Here, people can read up on the results of the research, can get an overview of the subject through the information sheets, read publications, and even access the raw data. Journalists in search of raw weather datasets are in luck with this resource.
On the European Climate Assessment & Dataset’s website, the public is able to discover information on extreme changes in either the climate or the weather. Researchers have the option of using the project’s research tool, called the KNMI Climate Explorer, to verify data, examine seasonal forecasts, and even take a closer look at the effects of El Nino among other applications. Since this dataset is updated daily, journalists using this source will be the first to know about any signs of extreme climate change.
Global Imagery Browse Services, GIBS for short, is an essential part of EOSDIS in its role as a provider of imagery services that are responsive and based on community standards. Put another way, GIBS allows regular people to interact with satellite images taken from virtually anywhere on earth in high definition. Since EOSDIS GIBS is made available through NASA’s earth science data, it’s easier than ever for journalists and researchers to learn more about the world in real time.
Operating under the United States Government’s National Oceanic and Atmospheric Administration, this website tells journalists and researchers everything they could ever want to know about how the Bering Sea’s climate and ecosystem is reacting to the changes that have occurred in the Arctic. Here, it’s possible to read essays, review projections, and use the online data tool to see how various climate indexes, biological, atmospheric, ocean data, and wildlife are doing in the Bering Sea.
The NCEI, or the National Centers for Environmental Information, is the final result after the merging of the National Oceanic and Atmospheric Administration’s three data centers. This page hosts a series of links divided into 22 categories that will link users to different resources, pages of interest, and climate and weather datasets. Journalists in search of information on climate, storms, precipitation, and a host of other weather concerns, will likely find what they’re looking for here.
The National Oceanic and Atmospheric Administration’s Global Monitoring Division provides information on the long-term trends of the forces of climate change on earth through its monitoring of key atmospheric metrics. Among these would be carbon monoxide, methane, nitrous oxide, and carbon dioxide by way of example. These metrics are then used to measure things like long-term ozone depletion, carbon dioxide sources and levels, as well as sinks. This is a climate change resource that journalists can use.
Ever wanted a better way to visual climate data? WorldClim is a software provider of free data that can then be applied to spatial modeling as well as for creating maps. The current version of this free software can only be applied to the current climate while the old version allows access to climate data from the past and the current and also lets users see the state of future climate predictions. Journalists can just follow the link and download the software.
The Knowledge Discovery Laboratory is a site that’s dedicated to the development of innovative technologies, the basics of machine learning, and the application of that knowledge in practical areas like network science, fraud detecting, and analysis of scientific data. The site hosts a fairly sizable dataset in the DBLP with 1.2 million objects and 2.48 million links included in the set. For researchers with an interest in the Knowledge Discovery Laboratory’s goals, this dataset is an excellent resource.
The website of the 9th Implementation Challenge is about helping researchers learn how to solve shortest path problems. For the creators, the site was built with two goals in mind. First, to find the best reproducible solutions. And two, to make it easy for researchers to collaborate and discover more effective solutions. Researchers who are interested in seeing how much progress is made with this can review the papers and the datasets on the site.
229. Network Repository
The Network Repository is a site where scientific data is stored with the addition of interactive visual tools that users can access and analyze. This site holds the dual distinction of being both the first repository of its kind and also the largest one on the web. Utilizing graphing data and intuitive, visually engaging images, making comparisons and finding new ways to contextualize data is a lot easier. Journalists can use this source to find stories within the scientific data.
230. Pajek datasets
Pajek Datasets is a page that provides a dataset that addresses the interactions between proteins found in budding yeast. After offering a short background explanation on the impact of finding new methods of detecting interactions along with the reasons why being able to discern the importance of various protein to protein interactions is essential, the site links viewers directly to a dataset that’s available for download. To learn more, researchers can click on the link at the bottom in order to read the paper published on the subject.
231. Mejn Network Data
This site seeks to share links to the network data sets that the web owner has used and compiled. The themes of the datasets range from American College football, political blogs, and books on American politics, to social networks, Les Miserables, and high-energy theory collaborations. Individuals with an interest in exploring these will have plenty of interest dataset themes to choose from. In addition, the data is free to use as long as there are references.
The Stanford GraphBase is the name that’s been given to a bunch of datasets and programs by Stanford’s Donald Knuth. When used in combination with each other, these programs and datasets are able to manipulate and generate graphs and networks. On this site, the materials required are available for downloading through the links. In these files, researchers will find football score data, dictionary data, data that concerns the reconstruction of the Mona Lisa, and many others.
Formerly known as the University of Florida Sparse Matrix Collection, the SuiteSparse Matrix Collection is a collection of matrices that have real life implications. According to the site, this particular collection is used more often than not for numerical linear algebra in developing and refining sparse matrix algorithms. Users tend to like the collection for its usefulness in running and testing the results of experiments. The datasets and matrix benchmarks are available to download directly on the site.
234. Graph Datasets
This is a set of datasets that the creators of this web page believed to be either relational or able to translate well to graph representation. Graph Datasets offers datasets such as the Predictive Toxicology Challenge data, IMDb data, mutagenesis data, MovieLens data, collaborative filtering, and proteins data to researchers who want to learn how to work with the raw numbers. The files are made available on this page and are primarily available to download in XML format.
235. Big Data News
Big Data News is a site that’s focused on big data and the fundamentals of data science. This site is home to the latest news and includes discussion of deep learning and Artificial Intelligence. In addition, Big Data News is also home to a massive dataset that contains a total of approximately 3.5 billion web pages. These are all separated by levels that are referred to as page-level graph, subdomain-level graph, first-level subdomain graph, and pay-level-domain graph respectively.
CNetS, or the Center for Complex Networks and Systems Research, operating under the umbrella of the Indiana University Network Science Institute and the School of Informatics and Computing. The site is intended to be a resource in the fields of data science, computational social science, and complex networks and systems with information on mining and traffic patterns online. In addition, CNetS also hosts a dataset containing approximately 53.5 billion network requests made by Indiana University users.
237. OONI Explorer
OONI Explorer, a part of the Open Observatory of Network Interference, is a project dedicated to providing free and open source software. It’s possible for users to then use the software to try their hand at blocking websites and messaging apps among other applications. Of particular interest to people who are interested in this technology, is the availability of free access to the raw data that OONI has collected. Just enter the information into OONI Explorer and interact with the data from there.
Challenges in Machine Learning is dedicated to the research and development of machine learning. On this site readers will find links to software, books, machine learning challenges, as well as notifications of upcoming workshops. The site even provides links to challenges that allow for post-date submissions. For journalists who are interested in seeing if machine learning can perform tasks like financial prediction or web page classification, this is a site that’s well worth a visit.
Currently working under the umbrella of CrowdANALYTIX, DataX is the machine part of a community-driven initiative that harnesses the power of the collective to create custom Artificial Intelligence, machine learning, and Neuro-Linguistic Programming applications. The role of DataX in the process is maintenance and deployment which in turn serve to make these solutions scalable. For journalists who have tons of text, video, and data to sort through, the bots available through CrowdANALYTIX and DataX can cut the research times in half.
240. Driven Data
Driven Data combines crowdsourcing with data science in a way that almost no other site does. Emphasizing its role in providing assistance to organizations who are tackling different social challenges, this site offers help by putting its substantial data science community to work creating statistical models that solve predictive questions. Driven Data appears to work primarily with nonprofits, but it’s potentially useful for anyone who has raw data in need of refining. Journalists can benefit from keeping this source in mind.
241. Open Big Data
Dandelion API is an application that handles semantic text analysis for big data. What this means for people who have data that requires processing is that this program will take disorganized text and find a way to put it in context. Journalists who are parsing through a lot of documents can definitely benefit from that capability. Dandelion API also has Open Big Data under the categories of Milano, Trento, and Europe. Although this API is paid, there’s a daily amount of text that can be analyzed for free.
242. Earth Models
Earth Models focuses on sharing and storing software and datasets as they relate to the earth. The modeling tools mentioned on the site include simulation software and processing as well as virtual data that borrows heavily from specific areas of study like tectonics and seismology. Journalists and researchers who wish to refine their knowledge on the subject can use the publications and articles on this site to do so before diving in with the modeling and visualization tools.
The Socioeconomic Data and Applications Center, or SEDAC, is one of the data centers associated with NASA’s EOSDIS system. On this site, readers will find datasets that offer numbers on climate change or gridded demographic information. The datasets can also be searched by themes such as Governance, Agriculture, Land Use, Health, Conservation, and Climate, Water, Remote Sensing, and Poverty. There are maps, galleries, guides that give more context to the data, and additional resources and tools that researchers can access on the site as well.
244. AODN Portal
The AODN Portal, a site held by the Australian Ocean Data Network, is a site that offers access to Australian climate science and marine data. Researchers who access this information will also have access to the IMOS data and the metadata, which is a research framework multiple institutions including the Australian Government support. Researchers who are opting to leverage the AODN Portal can expect to receive excellent delivery of the ocean data in an intuitive interface.
245. Planet OS
Planet OS offers a big data framework with an emphasis on renewable energy. This choice of niche coupled with the site’s proficiency has made it popular with energy companies in search of new ways to visualize and contextualize their data. Additionally, the site also has what’s called the DataHub present where it hosts a substantial collection of over 2,000 datasets. These datasets include open data through NASA and Copernicus and the data is often updated on a regular basis.
The Smithsonian has long been a respected academic name and is appreciated around the world for its commitment to research and history. In some respects it’s only natural that the Smithsonian would have a website that offers some of the best information on volcano research online. The site publishes reports, research links, and databases that include narrowed volcano, eruption, emission, and deformation searches to go with the Holocene volcano list and spreadsheet. Journalists won’t need another source on volcanic activity.
247. Earthquake Catalog
Updated and maintained by the US government’s Earthquake Hazards Program, the Earthquake Catalog allows researchers to see when and where an earthquake has last occurred. While search results are limited to 20,000, the catalog search is capable of filtering results by magnitude, date and time, and even by geographic region. This level of flexibility makes this resource particularly helpful for journalists who are covering a natural disaster or a local earthquake and are looking for some background information.
The American Economic Association provides researchers with data on macroeconomic data for the US and other countries around the world. This site doesn’t appear to produce economic data so much as it does curate a short list of the most dependable ones. However in light of the many sites offering economic information on the Internet, this is a resource that journalists can expect to have credibility. Just go to the site and click through the categories of economic data accordingly.
Historicalstatistics.org is an incredibly useful site for finding the type of economic information that presents an interesting contrast to the present. For example, the site’s historical currency converter allows researchers to find out how much a person with $10 USD in 1923 could buy today. It also hosts publications and papers that ask questions about the metrics used in the field of economic history along with price indices and information on money supply that can be filtered by country.
250. DB Nomics
What if all the public economic data on the Internet could be accessed and searched from one, single, navigable platform? Db.nomics is an economic database aggregator that seeks to do exactly that. The data is available in formats such as HTML, JSON, and CSV and automatically updates while previous revisions are archived accordingly. Economic sources include the Federal Reserve, the Bureau of Economic Analysis, the International Monetary Fund and others. Researchers looking for reputable economic data can’t go wrong with Db.nomics.
Developed through the combined work of the Bank for International Settlements, the Organization for Economic Cooperation and Development (OECD), the International Monetary Fund, and the World Bank, the Joint External Debt Hub makes information on debt data and international creditors and debtors accessible to the public. Journalists who are looking into the finances of different nations and attempting to get a deeper understanding of the international financial landscape will find virtually everything they want to know here.
Put together with the full input of a leading economic expert in Jon Haveman, this page on International Trade Data hosts data that can be downloaded and further analyzed. The datasets include tariff data, Penn World tables, utilities, import data, manufacturing productivity, goods classifications, Rauch Product Differentiation Codes, NBER data, the 1997 commodity flow survey, trade and immigration, and the useful gravity model. UNIX is the operating system used to compile these, but the site notes that PCs should have access to the data as well.
253. Open Corporates
On OpenCorporates researchers have the luxury of searching and finding information one of the largest open databases of companies around the world. This information is then utilized by different groups around the world such as banks, investigators, NGOs, and journalists in an effort for intelligence and information. Journalists have the added benefit of being able to access the data in real time with the help of the OpenCorproates API as well as through the bulk core data or other core datasets.
254. Our World in Data
Our World in Data takes information from a number of sources in a variety of areas and presents quantified data on it. From numbers on the participation of women in the workforce to information on general corruption perception in the public sector and global income inequality, if the subject can be discussed in terms of data, this website just might have an entry for it. Journalists can use this source to find statistics and numbers as they relate to social issues.
255. Science Po
Sciences Po, or as it’s known more commonly, the Institute of Political Studies is a school that has undeniable influence in the social sciences. In this case, Thierry Mayer’s page includes data files that feature gravity and military conflicts regressions data from “Make Trade Not War” as well as datasets on market potentials among several others. Journalists looking to better understand the conclusions reached in academic journals will uncover a lot of information while browsing this site.
Ever since making its debut in 1999, the Center for International Data has been dedicated to its mission of collecting, creating, improving, and distributing international economic data both offline and online. On the website, readers an access information such as US Tariff data, World and US imports and exports, and even information from the Penn World Table. With this information being made available to the public for education and research, journalists can access and use this data for free.
The Observatory of Economic Complexity, commonly referred to as the OEC, allows researchers, students, economists, and anyone else to visualize international trade data. With its eye-catching themes and interactive interface, this site gives researchers a legitimate opportunity to explore international trade information in ways that have never really been seen before. For journalists who like being able to see economic data come to life as they search for it, the OEC is an invaluable resource.
Higher education is a hot button topic in many circles with students and families alike wanting to know how schools stack up and how well students are learning. Through its College Scorecard data, the US Department of Education gives educators and students all of this information and more. These numbers cover 1996 to 2016 and include current data, scorecard data, and post-school earnings, and new National Student Loan Data System information. It’s a source of up-to-date post-secondary education that journalists should be sure to use.
As a dataset that deals primarily with energy, COMBED has an automatic claim to uniqueness. Throw in the fact that its data is renewed multiple times a minute while coming from a commercial building, and it becomes clear that this dataset is one of a kind. For journalists, this information is incredibly useful to have when preparing a piece on energy consumption. Accessing COMBED’s data is as easy as downloading and opening an Excel spreadsheet.
260. DRED Dataset
The DRED part of DRED Database stands for Dutch Residential Energy Dataset. As the title suggests, this data measures and studies how much energy a single Netherlands household will consume. Ambience, occupancy, electricity, and general household information were all monitored in this dataset from July 5th, 2015 to December 5th, 2015. Any journalists researching energy consumption would benefit from checking out the raw numbers provided here. Instructions for downloading the data in CSV can be found directly on the website.
261. ECO Dataset
ECO, which stands for Electricity Consumption and Occupancy, is a project operated and run by the Distributed Systems Group. The premise of this project had researchers monitoring the loads and detected occupancies in six Swiss households over the course of 8 months. This site offers access to that information as well as instructions and links to related publications. Thanks to the site’s visually interactive dashboard, journalists should have no problems translating the research into something engaging.
IAWE, which stands for Indian Dataset for Ambient Water and Energy, was created with the goal of monitoring the energy usage of a New Dehli home with electricity measurements from appliances, the electricity meter, and the circuit panel. Due to outages, differences in water supply, packet drops, and voltage fluctuation, the iAWE ran into problems that were unique to tracking energy usage in India. This is incredibly useful data for journalists to have when discussing energy usage patterns.
UK Domestic Appliance-Level Electricity, or UK-DALE, is the name given to a dataset that monitors and records how much power is demanded in a group of five households. Every six seconds, UK-DALE measures the demand from both the main power grid as well as the individual appliances in the house. Journalists with an interest in seeing how UK households use energy, can definitely have use for UK-DALE. The data is accessible and there’s a paper describing the system available for reading.
ArcGIS Hub is a platform that organizations and individuals can use to accomplish goals through of site-wide initiatives. With page templates, step by step guides, and examples available for viewing, this site is an excellent resource for ambitious social movers. Meanwhile, the information available under the Open Data tab hosts hundreds of datasets in the “Disaster” category alone. Whether journalists are researching data or contributing it, ArcGIS Hub is useful in more ways than one.
265. Cambridge GIS
Cambridge GIS is the City of Cambridge, Massachusetts’s open data repository. With the exception of the files that are too big for being downloaded through this GitHub repository, most of the city’s datasets can be found on this page. The individual datasets available here include commercial districts, easements, zoning districts, census results, cemeteries, and other landmarks and features that can be quantified by data. A journalist digging for local information will find everything they’re looking for and more on this site.
As a resource, Geo-Wiki is a site that’s dedicated to what it dubs the “citizen science movement”. Here, citizens are encouraged to try their hand at monitoring the environment. Researchers can find the latest news in the sidebar along with the names of the publications and free dataset and software downloads. The tools that the site makes accessible include maps, personal data uploads, validations, and hackathons that can be downloaded in Excel format and zip files.
The OpenStreetMap data extracts come from the OpenStreetMap project, the ongoing online attempt to create a map of the world through the edits and efforts of the global community of users. In order to get started with this data source, all content publishers need to do is choose their preferred continent and then find their preferred country after that point. There are no fees for downloading this Geofabrik GmbH and the data is updated daily as a general rule.
268. HIFLD Open Data
Operated and maintained by the Department of Homeland Security, HIFLD, for Homeland Infrastructure Foundation-Level Data, places geospatial data in the public domain. This data is distributed for the express purpose of providing support and information for the purposes of research and preparation in the community. This data can be downloaded into Shapefile and CSV and it can also be viewed on the web. For publishers, HIFLD Open Data makes geospatial data more visual and engaging than ever.
OpenAddresses specializes in address infrastructure and collection. Powered largely on the strength of the efforts of the community, this site uses GitHub as its development platform. Here, people can place addresses on a map after adding it to the database or they can take the data and begin geocoding directly right away. With all of the data and addresses open and requiring mere attribution, the regular data updates and the potential for geocoding advances make OpenAddresses a very interesting project.
270. Open Data LMU
Open Data LMU relies heavily on data from OpenStreetMap to aid in the development of the Fast Reverse Geocoder. What this means is that the application is capable of quickly taking a location on a map and finding a full address based on that point. This could potentially be applied to neighborhoods and counties as well. This web page hosts a bunch of links related to the application that include source code, datasets, and OpenStreetMap lookup tables and resolutions.
With the Environmental Data Explorer, journalists, researchers, and students can download and explore the very same datasets that the United Nations Environment Programme uses along with its affiliated organizations and partners. Searches can narrowed down by region and made using any or all of 500 filters. In addition, the datasets include categories such as health, GDP, climate, emissions, and freshwater that can be viewed directly on the site either in graph, table, or map form.
The African Development Bank Group’s site is a journalist’s first stop when looking for statistical information and indicators as they relate to the continent of Africa. Users scrolling through the data catalog can filter datasets by source, topic, and region. For anyone who is looking for deeper knowledge of the subject at hand, the site also offers links to an impressive list of publications that include such titles as the African Economic Outlook and the African Statistical Yearbook.
The NCI’s Genomic Data Commons is home to one of the most thorough cancer data repositories on the web that focuses on the area of cancer genomic studies. This site’s data portal hosts thousands of cases and covers 38 types of disease to go with 39 projects and is free to access. With data submissions being made primarily by institutions and researchers, the accuracy of this information makes it a data source that’s ideal for journalists to have in their back pocket.
274. PhysioBank Databases
The PhysioBank databases make physiological data available to individuals via the public domain. These databases are divided into two larger categories in waveform and clinical. Among the waveform subcategories there are image, interbeat interval databases, synthetic, gait and balance, ECG, and multi-parameter databases. A taste of the pure information coming out of this includes data on bedside vital sign data, oxygen saturation, and even cardiovascular disease. Journalists and individuals researching the human body may find some information here.
The Medicare Coverage database, maintained by the Centers for Medicare & Medicaid Services through the Department of Health and Human Services, offers researchers full access to a ton of information on medical services. There’s information on chronic conditions, drug spending, electronic clinical templates, the debt collection system, and research and demonstration grants to read up on it. For data taken directly from the source, this is the most thorough and comprehensive site of its kind online.
276. Open Payments Data
When most people go to the doctor, they typically don’t sit down to think about whether or not their primary care physician is benefiting financially after working with health care manufacturers. Open Payments informs the public about any meals, research, gifts, speaking fees, and travel expenses that the doctor or hospital has received from companies. Journalists in search of a hard-hitting story can either use the data explorer to view the information or click on the tab to download the data directly.
It’s partially written in the name, but FlickrLogos refers to a dataset that consists of company logos that have been photographed in a variety of different positions. Maintained by Augsburg University’s Multimedia Computing and Computer Vision Lab, this collection was originally compiled with the intent of training computers to recognize logo and text. To stay abreast of any progress made with this newsworthy program, it can’t hurt to download this dataset and see what it’s about.
ImageNet is a database full of pictures that have been organized by WordNet. There are annual challenges on the site that can be viewed even after closing and are focused on the creation of algorithms that are able to perform specified tasks. ImageNet is also home to numerous publications, citations, and slides. Tech-oriented content publishers would have every reason to use the explorer option to make sense of the WordNet structure as well as the cloud map.
The Stanford Dogs dataset contains tons of pictures and images of different dog breeds. With 120 different breeds of dogs included along with over 20,000 individual images, this ImageNet-powered database gives researchers plenty of pictures to work with while teaching machines how to recognize each dog breed. On the site, links to different publications discussing the use of datasets to teach computers about image recognition can be clicked on and read along with the dataset download.
280. SUN Database
The SUN database is the site of a project put together for the research community to make strides in areas like computer vision and graphics, data mining, machine learning, and neuroscience among others. Boasting over 131,000 images and almost 4,000 categories of objects in its indexes, this site is as comprehensive as it gets. For publishers who are interested in this database and what researchers have been able to do while using it, this is a data source that’s worth exploring.
The Oxford-IIIT Pet Dataset is a site that acts as a complement to a paper that was published at the 2012 IEEE Conference on Computer Vision and Pattern Recognition and hosts the original dataset that was used for the purposes of the paper. These images have been organized into roughly 37 pet categories to go with another 200 images associated with every class involved. Furthermore, this data can be downloaded directly through the links on the web page.
The Visual Genome API is the end result of the hard work done by several students and associate professors from Stanford University. With several papers to its name in the quest to create an API capable of evaluating and describing images, the program has successfully answered over a million questions while evaluating over 100,000 images. This API represents progress in the area of computer science and its related fields and the dataset can be downloaded directly on the site.
283. YouTube Faces
The YouTube Faces Database is focused on developing a solution to the issue of automatic facial recognition in videos. Altogether, the dataset has over 3,000 videos taken from YouTube of almost 1600 individuals at an average length of approximately 181 frames. Ultimately, the goal is to create an algorithm capable of creating labels that identify the person who is in the video. The data along with information on errors as well as the description methods are all available for viewing directly on the site.
The KEEL dataset repository contains the dataset of Java-based open source software that’s designed to assist in various types of knowledge data discovery. KEEL, known simply as Knowledge Extraction based on Evolutionary Learning, can be trained to learn how to add missing values, hybrid models, and statistical methods for evaluating experiments among a number of other tasks. The dataset downloads as well as a complete listing of the algorithms featured in KEEL can be downloaded directly from the site.
285. Lending Club
The Lending Club’s claim to fame is its status as a peer-to-peer lender that allows borrowers to receive loans even when they don’t necessarily have the credit score to borrow from traditional lenders. Along with the novelty aspects of how the site operates, it also provides statistics that include platform highlights, declined loan information, investor performance numbers, and even a data dictionary that contains historical data. These numbers generally start from 2007 and can be downloaded in CSV.
The Natural Museum of History is considered by many to be one of the most recognizable museums in the world, but the digitization and the ability to examine its catalogue through the open data portal would probably be news to a lot of people. With 91 datasets that include microfossil and fossil collections as well as index lot records among other materials, the data is open to the public and free to download in multiple formats.
This site is perfect for journalists and publishers who want to stay within certain style guidelines when describing and categorizing certain art, artist names, architecture, materials, and geographic names. This site links users directly to the controlled vocabulary databases that researchers and catalogers need to know about in order to meet international standards. Here at least, there’s no beating the Getty Vocabularies. The datasets can be explored through text or SPARQL and can be downloaded through the site.
The CLiPS Stylometry Investigation Corpus probably isn’t what most people think of when they hear the words CSI, but the CSI corpus is nonetheless a dataset composed of student reviews and student essays. Besides the text itself, the presence of meta-data and information embedded into the document are noted by the site to have multiple uses. Offered and distributed under the Creative Commons license, all that the corpus asks for in exchange for using the dataset is an attribution.
Universal Dependencies v2 refers to the second updated version of the Universal Dependencies project, an effort to develop a treebank annotation that can be used consistently with several different languages. In the updated version researchers will find dozens of UD treebanks for different languages including Afrikaans, ancient Greek, Japanese, Dutch, Finnish, and Chinese on top of English. The newest version of Universal Dependencies can be found and downloaded near the bottom of this web page.
Webhose is a top-notch source of datasets taken directly from the site’s repositories and opened up to the public. Researchers can sort news articles by language with Arabic, French, and Dutch being just a few of the languages with article numbers numbering more than 100,000. In addition, the English news articles are further broken down into categories like entertainment and sports to go with the review and forum posts. Digital publishers and researchers can benefit from exploring these datasets.
291. Wiki Data
Wikidata is an underrated source of content and ideas for publishers and researchers alike. According to this page, there a number of ways to access the material in the data dumps although the use of JSON is the one that the site recommends the most strongly. Available and totally free for both non-commercial and commercial use, all of the data available here can be accessed and downloaded free of charge under the Creative Commons license.
292. Wiki Links
Situated comfortably within the framework of Google Code, Wiki Links is an open source project that seeks to provide individuals with access to that particular, unique dataset. On this web page, researchers are able to download the README texts, data files, and the Creative Commons license altogether. Publishers or generally tech-oriented individuals have a lot to look forward to when looking through this massive dataset. Just navigate through the site, click, and start downloading the files.
WordNet is an English lexicon comprised of the components of language, adverbs, nouns, adjectives, and verbs that are categorized into distinct groupings that are then used to express particular ideas. The end result is a useful tool that categorizes words by how they’re used and what they mean rather than how they sound when they’re pronounced. The applications of WordNet in linguistic programming are noted along with the numerous publications and statistics available on the site.
294. Allen Brain Atlas
The Allen Brain Atlas, created by the Allen Institute for Brain Science Resources, is a tool for studying and learning more about the human brain and how it responds when the human body is healthy as well when there is disease. Using the atlas, researchers can learn about the human brain and its development as well as glioblastoma and the effect of cancer on the brain. Journalists covering these topics can visit this site for datasets and information.
The NITRC, or the Neuroimaging Informatics Tools and Resources Clearinghouse, is where journalists and publishers can go for neuroimaging. Put together and promoted as an initiative for data-sharing neuroimaging, this site is home to data from several projects such as the 1000 Connectome Project, the Addiction Connectome Preprocessed Initiative, as well as the INDI-Prospective and Retrospective projects respectively. Individuals are free to download the data through the website. Neuroimages taken at various stages have never been so accessible.
296. HCP Young Adult
The Human Connectome Project Young Adult project is a continuation of the ongoing effort to create an accurate map of the human connectome as it would be seen in most normal adults. Through two phases, 1200 healthy adults were scanned through a combination of techniques in resting-state fMRI and diffusion imaging. Journalists and publishers in search of information on the brain won’t find another site with more data on the human connectome in healthy young adults.
297. NIMH Data Archive
The NIMH Data Archive, or NDA for short, isn’t so much an independent data source as it is a platform for distributing and storing data. On this website, there’s data that has been collected over the course of multiple papers and research projects as well as the provision of methods and tools that enable better analysis and collaboration. Data summaries are freely available and content providers reporting the latest in science can use this information to break news.
NeuroData is dedicated to conducting research on the unique relationship between the mind and the brain. Thanks to the site’s commitment to open science and reproducible research, content providers have access to a publication and several datasets that can be accessed by following the links on the web page. Of particular interest to those who wish to see the data for themselves is the availability of free code and analysis tools that make exploring NeuroData’s work even more straightforward.
The NeuroElectro Project is designed with the intention of collecting the various electrophysiological characteristics associated with different neuron types and aggregating it into a single database. This project seeks to study the relationship between neurons in an effort to study the differences between various neuron types. The site links to articles and lists the neuron types and electrophysiology properties discovered so far. Content providers can rely on this site as a source of data on neuron-to-neuron relationships.
The Open Access Series of Imaging Studies, also known by many as OASIS, is a project that was designed with the goal of making datasets of brain MRIs accessible to the scientific community at large. Journalists and researchers can access publications that compare MRI data between adults as well as a comprehensive fact sheet from OASIS’s comprehensive paper comparing and contrasting results from over 400 subjects. The information and tools can be downloaded from the website in several formats.
For journalists who want to access MRI datasets without any of the hassle associated with some other sites, OpenfMRI.org’s focus on making MRI datasets accessible to the public is a positive development. Coming directly from the researchers themselves, this site hosts a variety of datasets such as the classification learning dataset, the mixed-gambles task, and the balloon analog risk-taking task. The variety and quantity of data makes it possible for researchers to find new avenues of inquiry through this site.
Borrowing its name heavily from the famous movie Forrest Gump, studyforrest seeks to understand what the brain is capable of when it has to perform at a higher-level while contending with natural but equally complex stimulation. Even so, the site acknowledges that the amount of fMRI data collected from these studies has broader applications than it would seem at first. Content providers can browse through the 19 publications that have utilized studyforrest data and can even access the data directly.
As would naturally be expected from the title, the Crystallography Open Database is a collection of 385,697 metal-organic minerals and compounds, organic, and inorganic crystal structures with the notable exception of biopolymers. Content providers looking to learn more about crystal structures can search by the structural formula or run a matched search query with the option to browse. There’s also software and data on this site that makes this website especially valuable in the field of chemistry.
Long considered one of the premier sources of information on outer space, NASA continues its tradition of being an invaluable resource with its Exoplanet archive. This site hosts a series of interactive tools and software such as the Transit and Ephemeris Service, the Periodogram, the Confirmed Planets Plotting Tool, and the ability to interactively upload files and search tables. Content providers searching for unique insights can use this data to do so the next time exoplanets happen to be making big news.
The ability to create three-dimensional maps of the Universe is possible for anyone to do with the help of the Sloan Digital Sky Survey, or SDSS. Reporters and content providers alike can access algorithms, imaging data, datasets, tutorials, and further development of visual materials for the purposes of educating the public both formally and informally. The site also explicitly discusses making its data accessible to the public via news and social media. This is an invaluable visual tool for content providers.
Statsci.org offers a comprehensive list of resources that the public can access and make use of depending on their particular needs. Some of the information includes the Electronic Dataset Service and case studies compiled by UCLA. Along with the raw datasets, there are also textbooks linked to on this page. This includes titles such as the Handbook of Small Data Sets and Case Studies in Biometry. Content providers in search of statistical can’t go wrong giving this a look.
ERIC, also known as the Institute of Education Sciences, is a resource that acts as a search engine for anyone who is looking for information on the field of education. The preliminary search even provides the option of filtering exclusively for peer reviewed information as well as for ERIC-based full texts. In light of how often education budgets and teaching methods seem to be in the news, this is a data source that journalists should keep in mind.
Created shortly after the conclusion of World War 2, the NTIS (National Technical Information Service) was formed with the goal of using data to help federal agencies make informed data-based decisions through the use of data. This agency was originally the US government’s data repository in the area of scientific research. Today, the site hosts millions of publications on a myriad of subjects. The historical information alone makes this site well deserving as a data source for journalists and publishers.
The website of the ODI (Open Data Institute) is home to what’s called the Open Data Certificate, which is a free tool available online that was developed for the express purpose of critically examining and recognizing the quality of open data. From the perspective of a publisher or a journalist, the site also hosts numerous datasets on subjects ranging from lists of grants to allergy alerts along with lists that can be downloaded in CSV format.
310. GitHub Archive
GitHub is easily one of the most popular and well-known data repositories and archives on the Internet due to its ease of collaboration, archive capacities, and accessibility where coding is concerned. Whether it involves attempts to create bots that can perform certain tasks, or developing applications, GitHub is a site where content publishers and journalists can easily stumble upon potentially newsworthy products. The archive can be accessed by following the tutorial instructions for either JSON or Big Query.
SocioPatterns is a project that’s focused on finding the patterns in human activity and social dynamics though data. As is expected with such a broad stated goal, the site’s information has been utilized in publications addressing a range of subjects from the spread of disease to case studies on the differences between the online and offline personas of individuals. The datasets are available for viewing as are the published papers containing information that publishers will likely find relevant at one point or another.
312. Indie Map
Indie Map is the result of taking information from over 2,000 of IndieWeb’s most busy sites and rolling the data into interactive visual forms such as a Social Graph API, a dataset with SQL query capabilities, and the raw information that was crawled on a total of 5.7 million web pages. Digital publishers interested in open source software and what this data might say about these online communities can access the information directly from the website.
Simon Fraser continues to build on its status as a reputable university with the availability of its dataset concerning the “Statistics and Social Network of YouTube Videos”. Drawing information from a crawler that used YouTube’s API to find videos, the files contain data on millions of videos and user information datasets. The site specifies that dataset downloads are for academic purposes only, but it may be possible to find journalistic sources and references on this project through the site.
ACLED, or the Armed Conflict Location and Event Data Project, offers public data concerning the protest and political violence in the developing world. The information given here includes numbers on fatalities, information on the dates and places of the violence or protest, names of the relevant groups, and data on riots and violent clashes that have occurred. ACLED provides access to regular reports, publications, and visuals as well as to data that’s available for download on the site.
The Canadian Legal Information Institute, or CanLII as it’s called in regular parlance, is a site that provides free access to statues and their regulations, case law from courts of various jurisdictions including the Supreme Court of Canada, the provincial and appellate courts, and Queen’s Bench, along with rulings from various administrative tribunals and statutorily-created bodies. With 301 case law databases and over 140,000 court decisions available for viewing, journalists and digital publishers who are covering legal topics would benefit from bookmarking this site.
The Center for Systemic Peace, or CSP, is an organization that has dedicated itself to analyzing global systems for the purpose of addressing the issue of political violence. On this site, journalists who are doing research in this area will have access to analysis on conflict in Africa, conflict trends on a global level among others along with summaries on the organization’s three primary publications such as Third World War and CSP’s Global Report and Virtual Library.
The focus of the Correlates of War, or COW, project is to aid in the gathering and distribution of quantitative data as in the area of international relations. Keeping with its commitment to applying scientific principles to international relations data, COW makes its datasets freely accessible to the public. These databases contain numbers on information that journalists and researchers can use such as militarized conflicts between states, state system members, national material capabilities, and formal alliances.
The European Social Survey, or the ESS, is a survey that’s conducted across Europe with the goal of measuring the various behavior patterns, attitudes, and beliefs of different populations in various nations. Since journalists are often interested in getting the public’s opinion, the ESS Topline Series covering subjects that range from the personal and social well-being of Europeans, attitudes towards welfare, and even the presence of ageism in the UK, can bolster a story in more ways than one.
319. Fund for Peace
The Fund for Peace is an organization that focuses on preventing conflict and concentrating on security through the development of tools that can be used to mitigate conflict. Over the course of the FFP’s career, it has worked in partnership with journalists, NGOs, local organizations and their international counterparts, as well as governments. Whether looking for trends, comparative analysis, and global data, journalists and publishers can find the materials they want through the site’s in-browser data exploration tool.
The work of the General Social Survey (GSS) focuses on gathering information on various aspects of modern American society as a means of keeping abreast of attitudinal and behavioral trends and patterns in the population. With this practice of trend-tracking going back to 1972, the historical data alone is a goldmine for journalists who want to explore trends. Using the GSS Data Explorer, researchers and journalists are able to download, examine, and even evaluate data.
GESIS is a German infrastructure institute that’s dedicated to the social sciences. It offers different social sciences, along with research work and services that range from survey methodology to applied computer science, data collection, study planning, and data analysis to name some of what GESIS does. Journalists who are interested in subjects like the GESIS approach to methodology, the utilization of over-qualified immigrants, and more can find a lot of information in the publication section.
From topics like abortion to sex education, religion is so pervasive, that even in largely secular countries a person’s position on difficult social subjects can be influenced by religious belief. In Global Religious Futures, Pew Research Center examines trends in people’s attitudes and beliefs as they relate to global religions. Journalists who are looking to examine details like the influence of Evangelicalism in politics or attitudes towards stoning can use the Data Explorer to find answers in the numbers.
The Index for Risk Management, also known as INFORM, is a place that researchers and journalists can go to for risk assessments in situations where there’s a risk of a humanitarian crisis or disaster. The organization offers links to data on topics such as child mortality rates, the gender inequality index, and drought frequencies among other numbers. It’s also possible to see and download INFORM’s data or to access it through the site’s interactive map.
INED, or the French Institute for Demographic Studies, is one of the most prolific sources of data and statistics that a journalist can find on the Internet. With over 70 publications being researched and published each year to go with the world population in graph format as well as statistics that measure questions like fertility difference between the sexes to go with news and resource methods. Researchers in pursuit of accurate findings from a reputable source will find them with INED.
Princeton’s International Networks Archive offers a unique combination of publications, with research like the Human Development Report 1999 and Global Networks: A Journal of Transnational Affairs available for reading on this site. On this site journalists will be able to pore over the archive’s comprehensive public historical and up-to-date data. Subjects covered here include healthcare, arms, books, music, migration, regions, Internet, politics, and transportation are just a sample of the materials that researchers can peruse here.
Founded from the beginning as a means of collaboration between people of different nations, the International Social Survey Programme (ISSP) has conducted surveys each year on a variety of subjects that are important in the area of social sciences. Topics over the years have included social inequality, national identity, citizenship, social networks, and work orientation among other matters. The ISSP’s findings can be searched by year or by topic and can be downloaded accordingly on the website.
Journalists or other researchers who are looking for informative, intellectual discussion on the issues as they relate to the subject of transnational, international, and global matters, the International Studies Association (ISA) brings together the expertise of researchers, academics, and policy experts among other names. The ISA’s Encyclopedia of International Studies features peer-reviewed essays that are full of in-depth discussion on topics concerning research in this field as well as essays that provide information that’s told from a more historical perspective.
Wesleyan’s University’s Professor James W. McGuire’s incredibly useful page, appropriately titled Cross-National Data on the Web, is a resource page full of links to relevant economic and global development data. Among the datasets questing journalists will be able to discover here, there’s data on family planning, educational achievement, undernourishment, water and sanitation, free-market orientation and information specific to Latin America and the Caribbean to name just a few from the list. The data sources listed here are impeccable.
The Norwegian Center for Research Data is an institution that supports and aids researchers in different areas of performing empirical research such as privacy, data collection and analysis, methodology evaluation, and ethics in research. Here, researchers will find software and tools developed and recommended by the center to go with an extensive collection of regional, individual, and institutional data that can be accessed for free. The center’s findings in the Research and Privacy Annual Report are also always an interesting read.
IPUMS isn’t necessarily a data source in the sense that most people would expect upon hearing the term, but it’s nonetheless a useful source for journalists because of its role as an aggregator, archiver, and organizer of the data that other entities provide it. Case in point, IPUMS USA acts as a data repository of sorts for US census microdata, with data going as far back as 1790 and dating all the way to the present.
The ND-GAIN Country Index, an initiative that was arranged by the University of Notre Dame Global Adaptation Initiative, measures a country’s resilience to climate change and other forces of globalization. This index includes vulnerability scores in areas like food, health, infrastructure, and ecosystems as well as readiness scores for as many as 500,000 data points. Containing two decades of data from the years 1995 through 2014 in the form of CSV files, this information is available for download.
332. Police UK
At this site, journalists and publishers can access open data concerning the state of policing and crime in the United Kingdom. The data contains useful statistics on neighborhood teams, individual police forces, stop and search numbers, as well as statistics on crime and outcomes. Using this site, journalists can run comparisons between police forces, and spot trends in criminal justice. Getting information is as simple as choosing the date range, choosing the police force, and then waiting for the file.
Paul Hensel’s General International Data Page is a series of links that are grouped under the headings States and the International System, International Geographic Data, State Capabilities, Social Science Data Collections, and Alliances, Treaties, and Organizations. Each resource listed on this web page contains state of the art data that will automatically give credibility to a journalist’s work. These sources can include anything from software to datasets and archives, but every link included here is useful.
In the post-911 world, terrorism and its devastating effects on local populations have gotten a lot of attention in the attention in the media. TRAC, at trackingterrorism.org, provides researchers with extensive analysis and information concerning these subjects. This site contains information on several thousand different terrorist groups. The single-user price of $500 may be a little steep up front but is well worth the price paid for those who are writing on violence and the war on terror.
Interested in the inner workings of the Texas Criminal Justice system? Curious to find out about who is currently on death row? The State of Texas’s Department of Criminal Justice has plenty of information from the last statements of prisoners before execution, death row statistics by gender and race, as well as well as further execution statistics and factsheets. The Texas Department of Criminal Justice is as reliable a source for journalists as it gets.
The integrated Civil Society Organizations System, or iCSO, is designed to make it easier for civil society organizations to communicate with the Department of Economic and Social Affairs. In addition to the effectiveness of the robust framework, the web page includes datasets and categories for further information. The data can be sorted by sector, the type of organization, the region involved, and the organization’s ECOSOC status. This is a data source that journalists can definitely use for finding sources.
Universities Worldwide is a database of universities around the globe that can be searched for further information. The search can be made by world listing or filtered exclusively to United States universities, and users are also able to add their own links in the process as well. Data publishers with a sizable student contingent, or even journalists who are looking to verify a fact, can all benefit from being able to access a database like this one on-demand.
This is the website for the Uppsala Conflict Data Program, which is one of the most well-known providers of information concerning organized violence. Over the course of the program’s last 40 years, it has also established itself for its work in collecting data from the civil war. Journalists can use this data source to search for information on specific conflicts and actors in those conflicts and can also access this data for the purposes of downloading it.
339. World Pop
The WorldPop project, the end result of merging the AsiaPop, AfriPop, and AmeriPop projects, is dedicated to the archiving of spatial demographic datasets that in turn has applications as a means of providing support to disaster relief efforts. Content publishers and journalists who are involved with social justice causes or who are otherwise researching efficient disaster relief opportunities would stand to be interested in this project. Researchers can download the data or review the case studies online.
340. Draft Express
DraftExpress is perhaps most well-known on the Internet for its research, pre-draft scouting reports, mock draft picks, and its meticulous maintenance of player heights and measurements to go with its historical data. The prospects mentioned on this site hail from the NCAA, high school, and even international leagues. Sports journalists or content providers who are intending to offer basketball-related commentary can turn to DraftExpress statistics while discussing players and events as they are occurring within the sport.
Betfair is perhaps best known as the site to go to for sports bets. Of interest to content providers and digital publishers in particular, however, is the availability of detailed historical information on the site’s pricing data and history. The data can be accessed and downloaded with or without the detailed time-stamp while also offering extensive data on horseracing and the site’s other market offerings from the year June of 2004 to October of 2017.
Cricsheet offers a similar service to a number of other sports data hubs, but it’s a site that specializes exclusively in providing cricket data. The site offers stats and ball-by-ball data from a number of leagues including the Indian Premier League, one-day internationals, as well as numbers for men’s and women’s teams to name just a few of the larger categories. Content publishers in search of historical cricket data can download the data in either CSV or XML format.
With political, economic, legislative, and domestic conflict data covering over 200 years’ worth of data in over 200 nations around the world, the Cross-National Time-Series Data Archive is one of the most comprehensive datasets on the Internet. The data is stored, most conveniently, in a Google Drive sheet that opens automatically upon clicking on the category of the file. In exchange for a citation, this information can be viewed in part or in whole depending on the researcher’s needs.
344. Ford GoBike
Ford GoBike is the name of the bike share program being used in the Bay Area. While many people are undoubtedly using this program to stay fit and avoid using fossil fuels, the bike share system has been keeping track of the trip data. Of course, this information doesn’t include anything that could identify the riders but details like the bike number, the start time, end time, rider type, and arrival and departure stations are included in the data. That’s worth exploring.
345. Marine Traffic
Marine Traffic is a company that traces and keeps track of the movements of vessels and ocean trips using big data. The type of information covered through Marine Traffic’s AIS API services included vessels, information on the voyages, and data on the vessels in question such as expected arrivals, even incidents, photos, vessel particulars, and voyage forecasts. There are pricing plans on this site, however, so it’s unclear how much research can be done for free.
Bixi bike share programs are perhaps some of the most well-known bike share programs in some of the biggest cities in North America. Interestingly enough, the brand also releases open data that provides information on things like trip history and station status as well as comparisons that could be made between members and occasional users. Local journalists who are looking into how individuals are using and fitting bike share programs into their lives have every reason to jump into this data.
347. Accident Database
From Amelia Earhart to Indonesia’s AirAsia Flight 8501, flight and airplane accidents are a topic that attracts people’s attention. The Accident Database archives and stores data on aviation accidents that have occurred between the years 1920 and 2017. Accidents counted in this database included civil and military airship accidents, accidents that involved the deaths of celebrities or someone famous, helicopter accidents with 10 or more deaths, scheduled and non-scheduled passenger air flights which ended in death.
348. Transport for London
Transport for London is the government body responsible for overseeing public transportation in the Greater London area. There are tube and rail maps available on the website along with a trip planning guide. In addition to all of these practical services, the site also hosts a lot of open data including cycling, air quality, tub, and even the oyster. Anyone with an interest in seeing how residents of Greater London are using public transport can benefit from having access to this data.
CMAP is responsible for doing regional planning and organization in the DuPage, Lake, McHenry, Kendall, Kane, Cook, and Will counties in Illinois. As can be seen from the website, CMAP’s responsibilities extend to addressing issues like community development, taxes and economic indicators, and even roads and transit. This is why the open data concerning areas like regional indicators and travel are useful to a journalist trying to understand the big picture where the region’s future is concerned.
Brought about thanks to a collaboration between the Bureau of Transportation Statistics and the Federal Highway Administration, the Freight Analysis Framework collects data that is then used to assess the general health and performance of the freight system. This software collects information on details like commodity type and tonnage as compared to the departure and arrival stations and that data has in turn been made available for the public to access and download either in full or in summary form.
351. Mozilla Science
Mozilla Science is an open source, open practice, collaborative software that’s there to aid in the development and distribution of different data sources and research findings. Transparency interests aside, the decision to open up programs and crowdsource the refinement process makes it easier to improve upon the programs already there. With projects available in a variety of fields including life sciences and medicine among others, there are software solutions here that may represent newsworthy progress in the field.
352. Cool Datasets
The attraction when it comes to Cool Datasets is easily apparent from the name of the site. On this page, the datasets fall under six general categories in government, entertainment, science, user submissions, miscellaneous, and machine learning. Journalists who would like to explore the data and mine for stories stand to gain the most from checking out what this website has to offer. There’s an option to explore the datasets and, if possible, to contribute datasets as well.
353. Open Data Monitor
OpenDataMonitor is a platform that takes public datasets and presents them in a way that’s more intuitive and easy for individuals to follow. Researchers can go to the platform to see a summary of what the open data resources are capable of, and they’ll be able to see the existing data presented to them in a more visually engaging fashion. The site explains its methodology and benchmarks, and publishers should have no problems finding data that’s worth publishing.
CrunchBase is perhaps best known for its emphasis on statistics and its commitment to getting its readers access to the best data available. Business-minded individuals come to this site every day to learn about the latest trends in investment and industry. Here, journalists and publishers will find news and fresh discussion of the latest business trends. Meanwhile, as part of its paid option, the site also hosts extensive datasets that can be analyzed using Crunchbase’s software tools.
Index is a platform with a unique selling proposition because it has something to offer everyone between startups, investors, analysts, and corporations. The site also hosts information on over 100,000 companies in the tech sector while simultaneously offering users the ability to sort, build, and export spreadsheets. Publishers and journalists skimming headlines for potential story angles can get ahead of business news through this site. Index may be unusual, but there’s no question that it’s a useful resource.
SEMrush prides itself on being one of the most well-established search intelligence tools available to online marketers. Between the academy and the webinars, researchers have every opportunity to learn the fundamentals. However, the blog and news sections contain enough information that publishers and journalists can easily stay on top of all the latest news in online advertising and SEO. SEMrush’s services do come at a price, but there’s plenty of quality information that they provide for free.
Ahrefs is perhaps best summed up as a suite of marketing tools that are potentially useful to anyone who is publishing content online. The site offers a combination of services such as content research, web monitoring, keyword research, and backlink research to help users reverse engineer the success of competitors. Probably most useful to online publishers in the grand scheme of things, the Ahrefs blog alone represents excellent value for those who like their marketing done with a data-based slant.
358. Angel List
AngelList, with its cleverly chosen name, is basically two parts Craigslist and one part LinkedIn with its emphasis on bringing investors, job-seekers, and startups together in one place. Along with this interesting site concept, there are plenty of opportunities for enterprising journalists to discover the hottest startups and the newest investment trends before they become mainstream. With its straightforward interface and its strong business orientation, this site is useful for professionals in more ways than one.
In pretty much all sectors, a business acquisition can change an entire industry virtually overnight. Acquired is a site that takes on the task of keeping members of the public informed when an acquisition has been made in the technology sector. Full of statistics that can be viewed on the site to go with the ability to filter searches, journalists who write about technology stand to gain the most from making Acquired a regular part of their daily web browsing.
Mattermark is a paid service that makes life easier for company decision-makers by producing quality customer lists that take both companies and their key employees into account. Providing comprehensive company profiles, flexibility with its API, and even export capabilities for the purposes of making updates, Mattermark pulls out all the stops. Businesses that want to target their outreach better while also tracking the results of their campaigns stand to gain quite a bit from signing up for this.
FintechStartupsCo serves as a type of aggregator that keeps track of how much different companies have been able to raise in their IPOs. Sporting a minimalist design that switches between the “startups” and the “news” tabs, this is a quick and easy way for journalists to stay on top of breaking fundraising news. Meanwhile, publishers searching for fast content and quick stories also stand to gain from taking another look at what this site has to offer.
Just in case the name didn’t give it all away, DataFox is a company that strives to give its business clients the information they need to maximize their CRMs and to generally make data-driven decisions. Offering services that include conference and company intelligence, APIs, and company signals, this company is effectively a one-stop shop for businesses that are revamping their sales and outreach work. Digital publishers and content providers may want to check it out as well.
OpenSpending is a free platform, that can be accessed virtually anywhere in the world, which allows users to search and examine financial data in the public domain. For non-hacking members of society, this is a powerful tool for visualizing and analyzing. Journalists, in particular, can use OpenSpending to find interesting new insights and pursue story angles as the company suggests directly on its site. As an added bonus, journalists are among the professions specifically requested on the forum.
364. ESPN Sports API
Not content with just being the most popular sports channel on cable, ESPN is expanding its influence into coding and APIs. In the Developer Center, the site offers publishers their pick of several APIs that include research notes, power rankings, draft picks, calendars, and headlines. There’s even an API that loads athlete profiles, biographies, and statistics in all of the major sports. Journalists who are preparing to write a story with sports content can only benefit from browsing through these.
365. Sports Reference
For number crunching sports-lovers who like their advanced statistics and resources all in one place, Sports Reference is one of the best sites on the web. With historical data that includes team and player statistics on offense and defense, sports researchers can go as broad or as narrow as they want with this site. In addition, there are numbers for virtually every major sport from basketball to baseball and separate data for different college sports as well.
The aptly named “Million Songs Collection” accounts for a full 28 datasets worth of metadata and information on the audio features of exactly one million songs. Largely the result of Columbia University’s LabROSA work alongside the Echo Nest, this information is accessible and hosted on Amazon’s AWS system. Users can run searches for the information through Infochimps which makes it even more accessible for journalists in search of obscure trivia as well as content publishers.