Using Existing Data¶

The Internet has coincided with an unprecedented expansion of the generation, storage, and collection of data. It contains large amounts of historical records, both archived events of the physical world (e.g. news stories), as well as human behaviors taking place on the Internet (e.g. social network behaviors). The collation of data in one place, making found data accessible to individuals in ways never before imaginable.

However, this comes with concerns and responsibilities: using ‘found data’ often obscures the origins and context of that data. Do the individuals in the dataset consent to how the data are being used? Are the unknown origins and mechanisms behind the dataset invalidating the conclusions being drawn with the data?

Is it ethical to use a dataset that contains individuals who don’t consent to it’s use? The individuals contained in datasets represent or describe humans and human activity. It’s important to recognize and respect this reality.

Example: For an analysis of a famous example (Pima Indians Diabetes dataset) of this, see article.

Example: Datasets of convenience are often used to build models. However correct conclusions and accurate predictions require a representative sample of the population in question. Being skeptical of the origins of a dataset prevents poor conclusions being drawn using the dataset. This often occurs when Data Scientists use someone else’s model to draw conclusions. For example, see article.

Data Science in Practice

Using Existing Data

Using Existing Data¶