# Data Science in Practice
---
Lecture notes for [DSC 80](https://dsc80.com) at UCSD.
---
By **Aaron Fraenkel**
**WARNING: THIS BOOK IS UNDER DEVELOPMENT. Regular changes to content will occur; typos are rampant!**
## Description of course
Students leave the course:
* equipped to “bring their own dataset” to
apply materials they are learning in upper-division classes and
outside projects.
* able to process and manipulate real world data, understanding the statistical implications of
those decisions.
* with a “big picture” understanding of why data science projects are
structured the way they are.
* with an understanding of the daily work of a data scientist and the context of
that work in the greater world.
## Prerequisites
* Knowledge of basic python
* Knowledge of basic probability and statistics helpful.
## Contents
### Understanding (Tabular) Data
The first section focuses on the basics of understanding
data and the processes that generate it. Quantitative and statistical
understanding of a dataset require clearly defining what is being
measured and in what ways. A tabular data structure is the natural
framework working toward these goals.
The chapters in this section cover:
* How tabular data structures describe measurements of real world
pheonomena.
* How to compute with tabular data structures to generate quantitative
descriptions of data.
* Understanding the connection between data in tables and the processes
that generate them, in order to make inferences about events in the
real world.
### Collecting Data and Extracting Information
The second section focuses on typical scenarios of data
collection. Data collection either consists of experimental design, to
generate new data for building models, or using existing data to
build models.
While science has traditionally focused on experimental design, recent
large-scale, indiscriminate collection of data has made using existing
data for new problems more common. This section focuses on the process
of finding data and transforming it into a tabular structure
analyzable using the techniques of Part 1 of the course.
The chapters in this section cover:
* Considerations for assessing and handling existing datasets
* Collecting data from the internet (over HTTP)
* Transforming non-tabular data into tabular data (i.e. what are the
observations?)
* Transforming text data into quantitative features (i.e. what are the
measurements?)
### Modeling with Data
The last section focuses on building models to answer questions and
solve problems using data. These models may be used for either
statistical inference or prediction. The material in this section
focuses on the overall structure of a modeling pipeline and
observations on how to assess the quality of a given model.
The chapters in this section cover:
* Statistical models and modeling pipelines.
* Bias and Variance (choosing a good model).
* Evaluating a model for fairness.
## License for this book
All content in this book (ie, any files and content in the `content/` folder)
is licensed under the [Creative Commons Attribution-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-sa/4.0/)
(CC BY-SA 4.0) license.