Course Description
An increasing amount of data is now generated in a variety of disciplines,
ranging from finance and economics, to the natural and social sciences.
Making use of this information requires both statistical tools and an
understanding of how the substantive scientific questions should drive
the analysis. In this hands-on course, we learn to explore and analyze
real-world datasets. We cover techniques for summarizing and describing data,
methods for statistical inference, and principles for effectively communicating results.
Prerequisites:
MS&E 120 or equivalent,
and CS 106A or equivalent
Instructors
Sharad Goel (
email)
Alex Chohlas-Wood (TA) (
email)
Josh Grossman (TA) (
email)
Jerry Lin (TA) (
email)
Schedule
Class: Tuesdays & Thursdays @ 10:30 AM - 11:50 AM PT (online)
Discussion Section: Thursdays @ 12:30 PM - 1:50 PM PT (online)
[ Zoom links posted on
Canvas. ]
Office Hours
Mondays @ 4 PM - 6 PM PT (Alex)
Tuesdays @ 4:30 PM - 6:30 PM PT (Sharad)
Wednesdays @ 10 AM - 12 PM PT (Jerry)
Wednesdays @ 6:30 PM - 8:30 PM PT (Josh)
Lectures and discussion sections will be recorded to ensure all students have access to the materials, regardless of timezone and internet connectivity. But please make every effort to attend lecture (and, ideally, discussion section) if you are able. Both the lectures and the discussion sections are interactive, so the learning experience is best when you attend, and regular attendance also helps foster a sense of community.
If you cannot attend the lectures live due to an extenuating circumstance (e.g., due to timezone constraints or personal obligations), please submit a short explanation.
If you would like to request some music to play at the beginning of class, please fill out this
form!
Office hours are a great opportunity to discuss not only topics directly related to the course,
but also anything else that's on your mind beyond the class, including, for example,
questions about career trajectories, and
research opportunities in MS&E and in the Computational Policy Lab.
Please note that there are no regular office hours during the first week of class, but
feel free to schedule an appointment if you would like to meet.
Communication
We use Piazza
to manage course questions and discussion, and you can sign up
here.
It is our intent that students from all backgrounds and perspectives be well served by this course, that students' learning needs be addressed both in and out of class, and that the diversity that students bring to this class be viewed as a resource, strength, and benefit. We aim to present materials and conduct activities in ways that are respectful of this diversity. Your suggestions are encouraged and appreciated. Please let us know ways to improve the effectiveness of the course for you personally or for other students or student groups.
You may use our (anonymous)
comment box
to let us know which aspects of the class
are going well and which could be improved.
We encourage you to work together in groups to solidify your understanding of the course material.
If you would like assistance forming a study group, please complete this form by Thursday, January 14 at 9pm PT.
Our goal is to form the study groups the following day,
so students can begin discussing the first homework assignment.
[ Optional ] Textbooks
All of Statistics by Larry Wasserman
(available
online)
R for Data Science by Garrett Grolemund and Hadley Wickham
Statistics by David Freedman, Robert Pisani, and Roger Purves
Natural Experiments in the Social Sciences by Thad Dunning
All of the key resources for this class are avilable online, free of charge.
However, please note that the department has created a new
Opportunity Fund
through which students may request financial assistance to purchase any necessary course materials.
Computing Environment
We primarily use
R
(
RStudio is the recommended interface),
including the suite of
tidyverse packages.
Evaluation
6 homework assignments (50%)
2 quizzes (25%)
Project proposal + final project (20%)
Attendance and participation (5%)
Resources
Syllabus
Week 1: Data Exploration & Visualization
January 11-15, 2021
Summary statistics, data manipulation, group-wise operations, joins, principles of plotting
Week 2: Statistical Inference I
January 18-22, 2021
Chapter 6 of
All of Statistics
Sampling distributions, statistical estimators, confidence intervals
Week 3: Statistical Inference II
January 25-29, 2021
Selected topics from Chapters 7, 8 & 9 of
All of Statistics
Maximum likelihood estimation, method of moments, the bootstrap
Week 4: Linear Regression I
February 1-5, 2021
Part III of
Statistics, and selected topics from Chapter 13 of
All of Statistics
Correlation, simple linear regression, confidence & prediction intervals
Week 5: Career Panel and Quiz
February 8-12, 2021
Panel on data science careers in government, academia and non-profits; take-home quiz
Week 6: Linear Regression II
February 15-19, 2021
Selected topics from Chapter 13 of
All of Statistics
Multiple regression, feature generation, model evaluation, normal equations
Week 7: Logistic Regression
February 22-26, 2021
Selected topics from Chapter 13 of
All of Statistics
Logistic regression, multinomial logistic regression, model evaluation
Week 8: Bias-Variance Tradeoff
March 1-5, 2021
Overfitting, under-fitting, cross-validation, regularization
Week 9: Causal Inference
March 8-12, 2021
Rubin causal model, response surface modeling, instrumental variables, diff-in-diff
Week 10: Review and Quiz
March 15-19, 2021
Review of course material; final project due; take-home quiz
Assignments and Quizzes
Unless otherwise stated, assignments are to be done individually.
You are welcome to work with others to master the principles and approaches used to
solve the homework problems, but the work you turn in should be your own.
Under no circumstance should you seek out or look at solutions to assignments given in previous years.
Late homework will not be accepted, but your lowest homework grade will be dropped.
Assignment 0:
Due Date: Thursday, January 14, noon PT
In preparation for the first discussion section,
install RStudio
(which in turn requires installing R).
Please also sign up for Piazza.
Assignment 1:
Due Date: Thursday, January 21, 9:00 pm PT
Exploring and visualizing data with
dplyr
and
ggplot2 in
R.
Details
here.
Assignment 2:
Due Date: Thursday, January 28, 9:00 pm PT
Statistical estimators and confidence intervals.
Details
here.
Assignment 3:
Due Date: Monday, February 8, 9:00 pm PT
The bootstrap, MLEs, and the method of moments.
Details
here.
Assignment 4: Project proposal
Due Date: Tuesday, February 9, 9:00 pm PT
In teams of 2-5 people, please complete this
project proposal form.
You are free to pursue any topic related to applied statistics.
In previous years, teams have considered
athletic performance, gender inequality, farming practices, restaurant quality,
music success, gentrification, and standardized testing, just to name a few.
Any data-driven investigation is fair game.
At the end of the quarter, each team will prepare either a 10-page written report
or a 10-minute video presentation.
To help assess the feasibility and suitability of your project,
please discuss your idea with the teaching staff before submitting your proposal.
Your group should
sign up for a 15-minute meeting slot. It is ideal if all members of your group can attend the meeting but, at a minimum, two members should be present. (This meeting is a required part of the assignment.)
Upon completing the project, we'll ask each student to evaluate the contributions of their team members, and
we'll consider these peer-reviews when determining final grades.
Quiz
24-hour take-home quiz
12:01 am - 11:59 pm PT on Thursday, February 11
The quiz is open book/computer.
It consists of 5 true/false and 15 multiple-choice questions that
test the conceptual ideas presented during weeks 1-3.
The quiz is designed to take approximately 1 hour.
Assignment 5:
Due Date: Thursday, February 25, 9:00 pm PT
Linear Regression.
Details
here.
Assignment 6:
Due Date: Thursday, March 4, 9:00 pm PT
Logistic regression.
Details
here.
Assignment 7:
Due Date: Thursday, March 11, 9:00 pm PT
Bias-variance trade-offs, cross-validation, and regularization.
Details
here.
Final Project
Due Date: Monday, March 15, 9:00 pm PT
10-minute video presentation
or 10-page single-spaced paper
Peer evaluations:
due on Thursday, March 18, 9:00 pm PT.
Your final paper or presentation should clearly state and motivate your research question, summarize
the related literature, describe your methods, detail your results
(and include the appropriate plots), and discuss the implications
of your findings.
Quiz
24-hour take-home quiz
12:01 am - 11:59 pm PT on Thursday, March 18
The quiz is open book/computer.
It consists of 5 true/false and 15 multiple-choice questions that
test the conceptual ideas presented throughout the course, focusing on material presented in weeks 4-9.
The quiz is designed to take approximately 1 hour.
Lectures
Lecture 1: Data exploration
Lecture 2: Visualization
Week 1 discussion section: Intro to R
Lecture 3: Intro to statistical inference
Lecture 4: Confidence intervals
Week 2 discussion section: Estimators
Lecture 5: The bootstrap
Lecture 6: Parametric inference
Week 3 discussion section: The bootstrap
Lecture 7: Correlation & regression
Lecture 8: Simple linear regression
Week 4 discussion section: MLE and method of
moments
Lecture 9: Uncertainty in regression
Lecture 10: Model evaluation & feature generation
Week 5 discussion section: Linear regression
Lecture 11: Logistic regression
Lecture 12: Multinomial logistic regression
Week 6 discussion section: Logistic regression
Lecture 13: Bias-variance tradeoff
Lecture 14: Regularization
Week 7 discussion section: Regularization
Lecture 15: Intro to causal inference
Lecture 16: Causal inference with observational data