## Data Science Track

Here we provide a list of topics covered by the Data Science track, split into methods and computational aspects. The ordering of topics does not reflect the order in which they will be introduced.

As explained in the overview of courses, the track consists of four levels. If you would like to understand how the curriculum is reflected by each of the four levels, please contact us, and we'll be happy to explain.

### Methods

#### Visualization and exploratory data analysis

- Insights that are hard to gain without proper visualization
- Useful vs. misleading visualizations, identifying misleading statements that superficially seem to be supported by data, judging data reliability
- 2D and 3D visualizations
- Dynamic and interactive visualizations, web-based visualizations, visualizations powered by GPUs

#### Datasets

- Data structures, representations, and transformations
- Time complexity of operations on datasets and related algorithm theory
- Conceptual issues with data cleaning and dealing with incomplete data
- Web scraping
- Practical data security issues

#### Regression problems

- Linear regression and its variants — Standard linear regression with homoskedasticity
- Logistic regression and its variants
- Avoiding a common confusion about causal interpretation of regression coefficients
- Non-linear regressions and non-parametric methods, identifying model misspecification
- Regressions with cutoffs: Heckman-type estimators
- Differences-in-differences estimators and their pitfalls
- Regression discontinuity design

— Standard linear regression with heteroskedasticity

— Linear regression with regularization

➢ Interpretation of sparse coefficients

➢ Computational considerations related to different regularization methods —Interpretation and treatment of outliers

#### Classification problems

- Use of logistic regression for classification problems
- Multinomial classification — Example: Multinomial classification on top of a pre-trained neural network

#### Shapes of probability distributions and hypothesis testing

- Gaussian distributions and their convenient properties
- The importance of fat-tailed distributions and their counterintuitive nature
- Single hypothesis testing
- Multiple hypothesis testing — Example: Problems in the scientific literature caused by p-value hacking
- Hypothesis testing vs. using a validation data set

#### Nuances of probability theory

- Common statistics paradoxes/misconceptions and their resolution
- Strategies for solving problems in probability theory and combinatorics

#### Performance evaluation, result significance, and common mistakes

- Performance metrics and their relationship to the problemâ€™s practical objective
- Dealing with imbalanced datasets
- Choosing the right baselines
- Ablation studies
- Significance of results

#### Supervised machine learning

- Traditional machine learning vs. deep learning: performance, advantages and disadvantages
- Methods of traditional machine learning: k nearest neighbors, logistic regression, decision trees, random forests, support vector machines, and others
- Overfitting and underfitting
- Regularization methods, bias vs. variance
- Non-traditional regularization methods: early stopping, dropout
- Training sets, validation sets, and test sets
- Designing a loss function reflecting the project's practical objective

#### Optimization methods

- Stochastic gradient descent, momentum, adaptive optimizers, results of optimizer architecture search
- Backpropagation through a computation graph
- Second order methods
- Learning rate choice methods for faster optimizations
- Non-gradient-based methods, including evolutionary methods

#### Standard neural network architectures and related training methods

- Neural networks based on fully connected layers
- Convolutional neural networks
- Recurrent neural networks
- Neural networks with entity embedding layers
- Architectures for natural language processing

#### Practical aspects of training machine learning models

- Data normalization and pre-processing
- Data augmentation
- Using imbalanced datasets
- Transfer learning
- Semi-supervised learning
- Hyperparameter search
- Implications of the extent of hyperparameter search for comparing the performance of different methods

#### Improving the performance of ML models

- Heuristic inspection of data
- Dealing with a mismatch between collected data and real-world deployment data
- Avoiding data leakage
- Model ensembles, stacking, bagging, and boosting
- Strategies for debugging and interpreting machine learning models

#### Unsupervised learning

- Clustering
- Principal component analysis and its relationship to singular value decomposition, independent component analysis

#### Autoencoders, generative models

#### Time-series analysis

- Autoregressive processes, vector autoregression
- Common mistakes when analyzing time-series data

#### Causal inference

- Instrumental variables: Two-stage least squares
- Instrumental variables: Non-linear models/machine-learning models
- Causal calculus ("do-calculus")

#### Functional programming

- Functional programming concepts and their usefulness in parallel computing
- Functional programming for writing bug-free code and for code verification

### Computational aspects

#### Command line interfaces and operating systems

- Data science environments on Linux, MacOS, and Windows
- Basics of the Ubuntu Linux operating system
- Basic Bash commands, Bash scripts, less known but highly convenient Bash commands

#### GPU computing

- GPU architectures and types of processing units inside GPUs
- GPU performance metrics and tradeoffs between them
- Types of code that can and cannot be accelerated by GPUs
- Low-level and high-level frameworks for GPU computing
- GPU computing in the cloud
- Systolic arrays, tensor cores, and TPUs

#### Optional: Building one's own GPU-powered computer

- Price vs. performance considerations, component choice, assembly

#### Python: language structure

- Guidance on code structure, style conventions, and naming of variables
- Variable types and their properties
- Object-oriented programming
- Python packages, libraries, and frameworks
- Python language pitfalls and common mistakes

#### Python: libraries

- Libraries for data analysis and numerical computations on CPUs
- RAPIDS for computations on GPUs
- Visualization libraries
- Deep learning libraries TensorFlow, Keras, and PyTorch
- Web development frameworks
- Web scraping libraries

#### Python: computation speed

#### Data formats

- Transforming and processing data in various standard and less standard formats
- Relational databases, NoSQL databases, their advantages and disadvantages
- Building data pipelines

#### Docker

#### Production tools

#### Cloud computing

#### C language

- Language structure and common practices
- Advantages and disadvantages of writing code in C

#### Cryptography

- Encryption and decryption, public and private keys
- Basic understanding of issues related to post-quantum cryptography