Data Science in Practice


Lecture notes for DSC 80 at UCSD.


By Aaron Fraenkel

WARNING: THIS BOOK IS UNDER DEVELOPMENT. Regular changes to content will occur; typos are rampant!

Description of course

Students leave the course:

  • equipped to “bring their own dataset” to apply materials they are learning in upper-division classes and outside projects.

  • able to process and manipulate real world data, understanding the statistical implications of those decisions.

  • with a “big picture” understanding of why data science projects are structured the way they are.

  • with an understanding of the daily work of a data scientist and the context of that work in the greater world.

Prerequisites

  • Knowledge of basic python

  • Knowledge of basic probability and statistics helpful.

Contents

Understanding (Tabular) Data

The first section focuses on the basics of understanding data and the processes that generate it. Quantitative and statistical understanding of a dataset require clearly defining what is being measured and in what ways. A tabular data structure is the natural framework working toward these goals.

The chapters in this section cover:

  • How tabular data structures describe measurements of real world pheonomena.

  • How to compute with tabular data structures to generate quantitative descriptions of data.

  • Understanding the connection between data in tables and the processes that generate them, in order to make inferences about events in the real world.

Collecting Data and Extracting Information

The second section focuses on typical scenarios of data collection. Data collection either consists of experimental design, to generate new data for building models, or using existing data to build models.

While science has traditionally focused on experimental design, recent large-scale, indiscriminate collection of data has made using existing data for new problems more common. This section focuses on the process of finding data and transforming it into a tabular structure analyzable using the techniques of Part 1 of the course.

The chapters in this section cover:

  • Considerations for assessing and handling existing datasets

  • Collecting data from the internet (over HTTP)

  • Transforming non-tabular data into tabular data (i.e. what are the observations?)

  • Transforming text data into quantitative features (i.e. what are the measurements?)

Modeling with Data

The last section focuses on building models to answer questions and solve problems using data. These models may be used for either statistical inference or prediction. The material in this section focuses on the overall structure of a modeling pipeline and observations on how to assess the quality of a given model.

The chapters in this section cover:

  • Statistical models and modeling pipelines.

  • Bias and Variance (choosing a good model).

  • Evaluating a model for fairness.

License for this book

All content in this book (ie, any files and content in the content/ folder) is licensed under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Icon made by monkik from www.flaticon.com is licensed by CC 3.0 BY