Data Science Track
Our curriculum is now organized not into tracks, but into modules described here . But for your information, we include here a description of the Data Science track before the reorganization.
Methods
Visualization and exploratory data analysis
- Insights that are hard to gain without proper visualization
- Useful vs. misleading visualizations, identifying misleading statements that superficially seem to be supported by data, judging data reliability
- 2D and 3D visualizations
- Dynamic and interactive visualizations, web-based visualizations, visualizations powered by GPUs
Datasets
- Data structures, representations, and transformations
- Time complexity of operations on datasets and related algorithm theory
- Conceptual issues with data cleaning and dealing with incomplete data
- Web scraping
- Practical data security issues
Regression problems
- Linear regression and its variants — Standard linear regression with homoskedasticity
- Logistic regression and its variants
- Avoiding a common confusion about causal interpretation of regression coefficients
- Non-linear regressions and non-parametric methods, identifying model misspecification
- Regressions with cutoffs: Heckman-type estimators
- Differences-in-differences estimators and their pitfalls
- Regression discontinuity design
— Standard linear regression with heteroskedasticity
— Linear regression with regularization
➢ Interpretation of sparse coefficients
➢ Computational considerations related to different regularization methods —Interpretation and treatment of outliers
Classification problems
- Use of logistic regression for classification problems
- Multinomial classification — Example: Multinomial classification on top of a pre-trained neural network
Shapes of probability distributions and hypothesis testing
- Gaussian distributions and their convenient properties
- The importance of fat-tailed distributions and their counterintuitive nature
- Single hypothesis testing
- Multiple hypothesis testing — Example: Problems in the scientific literature caused by p-value hacking
- Hypothesis testing vs. using a validation data set
Nuances of probability theory
- Common statistics paradoxes/misconceptions and their resolution
- Strategies for solving problems in probability theory and combinatorics
Performance evaluation, result significance, and common mistakes
- Performance metrics and their relationship to the problem’s practical objective
- Dealing with imbalanced datasets
- Choosing the right baselines
- Ablation studies
- Significance of results
Supervised machine learning
- Traditional machine learning vs. deep learning: performance, advantages and disadvantages
- Methods of traditional machine learning: k nearest neighbors, logistic regression, decision trees, random forests, support vector machines, and others
- Overfitting and underfitting
- Regularization methods, bias vs. variance
- Non-traditional regularization methods: early stopping, dropout
- Training sets, validation sets, and test sets
- Designing a loss function reflecting the project's practical objective
Optimization methods
- Stochastic gradient descent, momentum, adaptive optimizers, results of optimizer architecture search
- Backpropagation through a computation graph
- Second order methods
- Learning rate choice methods for faster optimizations
- Non-gradient-based methods, including evolutionary methods
Standard neural network architectures and related training methods
- Neural networks based on fully connected layers
- Convolutional neural networks
- Recurrent neural networks
- Neural networks with entity embedding layers
- Architectures for natural language processing
Practical aspects of training machine learning models
- Data normalization and pre-processing
- Data augmentation
- Using imbalanced datasets
- Transfer learning
- Semi-supervised learning
- Hyperparameter search
- Implications of the extent of hyperparameter search for comparing the performance of different methods
Improving the performance of ML models
- Heuristic inspection of data
- Dealing with a mismatch between collected data and real-world deployment data
- Avoiding data leakage
- Model ensembles, stacking, bagging, and boosting
- Strategies for debugging and interpreting machine learning models
Unsupervised learning
- Clustering
- Principal component analysis and its relationship to singular value decomposition, independent component analysis
Autoencoders, generative models
Time-series analysis
- Autoregressive processes, vector autoregression
- Common mistakes when analyzing time-series data
Causal inference
- Instrumental variables: Two-stage least squares
- Instrumental variables: Non-linear models/machine-learning models
- Causal calculus ("do-calculus")
Functional programming
- Functional programming concepts and their usefulness in parallel computing
- Functional programming for writing bug-free code and for code verification
Computational aspects
Command line interfaces and operating systems
- Data science environments on Linux, MacOS, and Windows
- Basics of the Ubuntu Linux operating system
- Basic Bash commands, Bash scripts, less known but highly convenient Bash commands
GPU computing
- GPU architectures and types of processing units inside GPUs
- GPU performance metrics and tradeoffs between them
- Types of code that can and cannot be accelerated by GPUs
- Low-level and high-level frameworks for GPU computing
- GPU computing in the cloud
- Systolic arrays, tensor cores, and TPUs
Optional: Building one's own GPU-powered computer
- Price vs. performance considerations, component choice, assembly
Python: language structure
- Guidance on code structure, style conventions, and naming of variables
- Variable types and their properties
- Object-oriented programming
- Python packages, libraries, and frameworks
- Python language pitfalls and common mistakes
Python: libraries
- Libraries for data analysis and numerical computations on CPUs
- RAPIDS for computations on GPUs
- Visualization libraries
- Deep learning libraries TensorFlow, Keras, and PyTorch
- Web development frameworks
- Web scraping libraries
Python: computation speed
Data formats
- Transforming and processing data in various standard and less standard formats
- Relational databases, NoSQL databases, their advantages and disadvantages
- Building data pipelines
Docker
Production tools
Cloud computing
C language
- Language structure and common practices
- Advantages and disadvantages of writing code in C
Cryptography
- Encryption and decryption, public and private keys
- Basic understanding of issues related to post-quantum cryptography