What is Good Data and Where Do You Find It?

March 31, 2021June 2, 2022 Steve

Bad information is worse than no information the least bit.
What is “good” information and the place do you uncover it?
Best practices for information analysis.

There’s no such issue pretty much as good information, nevertheless there are a variety of parts that qualify information almost pretty much as good [1]:

It’s readable and well-documented,
It’s obtainable. For occasion, it’s accessible by the use of a trusted digital repository.
The information is tidy and re-usable by others with a take care of ease of (re-)executability and reliance on deterministically obtained outcomes [2].

Following just some best practices will make sure that any information you accumulate and analyze is perhaps almost pretty much as good as a result of it would get.

1. Collect Data Carefully

Good information models will embrace flaws, and these flaws should be readily apparent. For occasion, an reliable information set may have any errors or limitations clearly well-known. However, it’s truly as a lot as you, the analyst, to make an educated selection regarding the prime quality of data after you’ve gotten it in hand. Use the equivalent due diligence you’d take in making a major purchase: if you’ve found your “good” information set, perform additional web-searches with the goal of uncovering any flaws.

Some key questions to consider [3] :

Where did the numbers come from? What do they suggest?
How was the data collected?
Is the data current?
How right is the data?

Three good sources to collect information from

US Census Bureau

U.S. Census Bureau information is obtainable to anyone freed from cost. To get hold of a CSV file:

Go to information.census.gov[4]
Search for the topic you’re all in favour of.
Select the “Download” button.

The large choice of wonderful information held by the Census Bureau is staggering. For occasion, I typed “Institutional” to hold up the inhabitants in institutional providers by intercourse and age, whereas information scientist Emily Kubiceka used U.S. Census Bureau information to examine listening to and deaf Americans [5].

Data.gov

Data.gov [6] accommodates information from many different US authorities companies along with native climate, meals safety, and authorities budgets. There’s a staggering amount of knowledge to be gleaned. As an occasion, I found 40,261 datasets for “covid-19” along with:

Louisville Metro Government estimated expenditures related to COVID-19.
State of Connecticut statistics for Connecticut correctional providers.
Locations offering COVID-19 testing in Chicago.

Kaggle

Kaggle [7] is an infinite repository for public and personal information. It’s the place you’ll uncover information from The University of California, Irvine’s Machine Learning Repository, information on the Zika virus outbreak, and even information on people attempting to buy firearms. Unlike the federal authorities websites listed above, you’ll have to take a look at the license information for re-use of a particular dataset. Plus, not all information models are wholly reliable: take a look at your sources fastidiously sooner than use.

2. Analyze with Care

So, you’ve found the perfect information set, and you’ve checked it to confirm it’s not riddled with flaws. Your analysis is going to be handed alongside to many people, most (or all) of whom aren’t ideas readers. They may not know what steps you took in analyzing your information, so make sure your steps are clear with the subsequent best practices [3]:

Don’t use X, Y or Z for variable names or objects. Do use descriptive names like “2020 jail inhabitants” or “Number of ice lotions provided.”
Don’t guess which fashions match. Do perform exploratory information analysis, take a look at residuals, and validate your outcomes with out-of-sample testing when potential.
Don’t create seen puzzles. Do create well-scaled and well-labeled graphs with relevant titles and labels. Other concepts [8]: Use readable fonts, small and neat legends and avoid overlapping textual content material.
Don’t assume that regression is a magic gadget. Do check out for linearity and normality, remodeling variables if compulsory.
Don’t go on a model besides you already know exactly what it means. Do be able to make clear the logic behind the model, along with any assumptions made.
Don’t miss uncertainty. Do report your commonplace errors and confidence intervals.
Don’t delete your modeling scratch paper. Do go away a paper path, like annotated recordsdata, for others to watch. Your predecessor (when you’ve moved alongside to greater pastures) will thanks.

3. Don’t be the weak hyperlink inside the chain

Bad information doesn’t appear from nowhere. That information set you started with was created by any person, most likely a lot of people, in a lot of utterly totally different phases. If they too have adopted these best practices, then the consequence is perhaps a helpful piece of data analysis. But in case you introduce error, and fail to account for it, these errors are going to be compounded as the data will get handed alongside.

References

Data set image: Pro8055, CC BY-SA 4.0 by way of Wikimedia Commons

[1] Message of the day

[2] Learning from reproducing computational outcomes: introducing three guidelines and the Reproduction Package

[3] How to avoid trouble: guidelines of fantastic information analysis

[4] United States Census Bureau

[5] Better information end in greater forecasts

[6] Data.gov

[7] Kaggle

[8]Twenty tips for good graphics