## Data Science Track

Our curriculum is now organized not into tracks, but into modules described here . But for your information, we include here a description of the Data Science track before the reorganization.

### Methods

#### Visualization and exploratory data analysis

• Insights that are hard to gain without proper visualization
• Useful vs. misleading visualizations, identifying misleading statements that superficially seem to be supported by data, judging data reliability
• 2D and 3D visualizations

#### Datasets

• Data structures, representations, and transformations
• Time complexity of operations on datasets and related algorithm theory
• Conceptual issues with data cleaning and dealing with incomplete data
• Web scraping
• Practical data security issues

#### Regression problems

• Linear regression and its variants
•   — Standard linear regression with homoskedasticity
— Standard linear regression with heteroskedasticity
— Linear regression with regularization
➢ Interpretation of sparse coefficients
➢ Computational considerations related to different regularization methods   —Interpretation and treatment of outliers
• Logistic regression and its variants
• Avoiding a common confusion about causal interpretation of regression coefficients
• Non-linear regressions and non-parametric methods, identifying model misspecification
• Regressions with cutoffs: Heckman-type estimators
• Differences-in-differences estimators and their pitfalls
• Regression discontinuity design

#### Classification problems

• Use of logistic regression for classification problems
• Multinomial classification
•   — Example: Multinomial classification on top of a pre-trained neural network

#### Shapes of probability distributions and hypothesis testing

• Gaussian distributions and their convenient properties
• The importance of fat-tailed distributions and their counterintuitive nature
• Single hypothesis testing
• Multiple hypothesis testing
•   — Example: Problems in the scientific literature caused by p-value hacking
• Hypothesis testing vs. using a validation data set

#### Nuances of probability theory

• Common statistics paradoxes/misconceptions and their resolution
• Strategies for solving problems in probability theory and combinatorics

#### Performance evaluation, result significance, and common mistakes

• Performance metrics and their relationship to the problem’s practical objective
• Dealing with imbalanced datasets
• Choosing the right baselines
• Ablation studies
• Significance of results

#### Supervised machine learning

• Methods of traditional machine learning: k nearest neighbors, logistic regression, decision trees, random forests, support vector machines, and others
• Overfitting and underfitting
• Regularization methods, bias vs. variance
• Non-traditional regularization methods: early stopping, dropout
• Training sets, validation sets, and test sets
• Designing a loss function reflecting the project's practical objective

#### Optimization methods

• Stochastic gradient descent, momentum, adaptive optimizers, results of optimizer architecture search
• Backpropagation through a computation graph
• Second order methods
• Learning rate choice methods for faster optimizations
• Non-gradient-based methods, including evolutionary methods

#### Standard neural network architectures and related training methods

• Neural networks based on fully connected layers
• Convolutional neural networks
• Recurrent neural networks
• Neural networks with entity embedding layers
• Architectures for natural language processing

#### Practical aspects of training machine learning models

• Data normalization and pre-processing
• Data augmentation
• Using imbalanced datasets
• Transfer learning
• Semi-supervised learning
• Hyperparameter search
• Implications of the extent of hyperparameter search for comparing the performance of different methods

#### Improving the performance of ML models

• Heuristic inspection of data
• Dealing with a mismatch between collected data and real-world deployment data
• Avoiding data leakage
• Model ensembles, stacking, bagging, and boosting
• Strategies for debugging and interpreting machine learning models

#### Unsupervised learning

• Clustering
• Principal component analysis and its relationship to singular value decomposition, independent component analysis

#### Time-series analysis

• Autoregressive processes, vector autoregression
• Common mistakes when analyzing time-series data

#### Causal inference

• Instrumental variables: Two-stage least squares
• Instrumental variables: Non-linear models/machine-learning models
• Causal calculus ("do-calculus")

#### Functional programming

• Functional programming concepts and their usefulness in parallel computing
• Functional programming for writing bug-free code and for code verification

### Computational aspects

#### Command line interfaces and operating systems

• Data science environments on Linux, MacOS, and Windows
• Basics of the Ubuntu Linux operating system
• Basic Bash commands, Bash scripts, less known but highly convenient Bash commands

#### GPU computing

• GPU architectures and types of processing units inside GPUs
• GPU performance metrics and tradeoffs between them
• Types of code that can and cannot be accelerated by GPUs
• Low-level and high-level frameworks for GPU computing
• GPU computing in the cloud
• Systolic arrays, tensor cores, and TPUs

#### Optional: Building one's own GPU-powered computer

• Price vs. performance considerations, component choice, assembly

#### Python: language structure

• Guidance on code structure, style conventions, and naming of variables
• Variable types and their properties
• Object-oriented programming
• Python packages, libraries, and frameworks
• Python language pitfalls and common mistakes

#### Python: libraries

• Libraries for data analysis and numerical computations on CPUs
• RAPIDS for computations on GPUs
• Visualization libraries
• Deep learning libraries TensorFlow, Keras, and PyTorch
• Web development frameworks
• Web scraping libraries

#### Python: computation speed

• Data structures and related tradeoffs
• Broadcasting between variables of different dimensions
• Vectorization, "Single Instruction Multiple Data", multiple cores, multiple processors
• Python libraries designed to make computation speed comparable to C code
• Dealing with insufficient memory
• Profiling (processors and memory)

• #### Data formats

• Transforming and processing data in various standard and less standard formats
• Building data pipelines

#### Docker

• Virtual environments in general
• Docker properties and related tradeoffs
• Multi-container Docker applications

• #### Production tools

• Spark: Running parallel computations on a Spark cluster
• Kubernetes: Basics of container orchestration with Kubernetes

• #### Cloud computing

• Comparison of cloud offerings of major providers
• Building scalable web applications

• #### C language

• Language structure and common practices