Recall

Mon 01 January 0001


Transformers

Attention

Query, Key, Value

Head

A head computes attention-weighted representation of the input. The idea is

that each head can learn to pay attention to different types of information in the input data.A head is  A transformer has multiple heads. Each head is a different attention mechanism.

Math

Greek Alphabet

Alpha (Α, α)
Beta (Β, β)
Chi (Χ, χ)
Delta (Δ, δ)
Epsilon (Ε, ε)
Eta (Η, η)
Gamma (Γ, γ)
Iota (Ι, ι)
Kappa (Κ, κ)
Lambda (Λ, λ)
Mu (Μ, μ)
Nu (Ν, ν)
Omega (Ω, ω)
Omicron (Ο, ο)
Phi (Φ, φ)
Pi (Π, π)
Psi (Ψ, ψ)
Rho (Ρ, ρ)
Sigma (Σ, σ/ς)
Tau (Τ, τ)
Theta (Θ, θ)
Upsilon (Υ, υ)
Xi (Ξ, ξ)
Zeta (Ζ, ζ)

Softmax

Softmax is a function that takes as input a vector of K real numbers, and normalizes it into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers.
output = e^x / sum(e^x)
example: [1, 2, 3] -> [0.09, 0.24, 0.67]

Sigmoid

Sigmoid is a function that takes as input a real number, and normalizes it into a probability distribution consisting of two probabilities.
output = 1 / (1 + e^-x)
example: 0.5

Cross Entropy

Cross Entropy is a loss function that measures the performance of a classification model whose output is a probability value between 0 and 1.
output = -sum(y * log(y_hat))
example: [0, 1, 0] -> [0.09, 0.24, 0.67] -> 1.42

Logarithms

Logarithm Basics

A logarithm is the inverse of exponentiation. It answers the question: to what exponent must we raise a base number to produce a given number? Expressed as log_b(x) for base b and number x.

Natural Logarithm

The natural logarithm, denoted as ln(x), uses the base e (Euler's number, approximately 2.71828). It's crucial in calculus and mathematical modeling.

Logarithm Properties: Product Rule

Logarithm of a product: log_b(xy) equals the sum of the logarithms of x and y, i.e., log_b(x) + log_b(y).

Logarithm Properties: Quotient Rule

Logarithm of a quotient: log_b(x/y) equals the logarithm of x minus the logarithm of y, i.e., log_b(x) - log_b(y).

Logarithm Properties: Power Rule

Logarithm of a power: log_b(x^y) equals y times the logarithm of x, i.e., y * log_b(x).

Change of Base Formula

To change the base of a logarithm: log_b(x) = log_c(x) / log_c(b), where c is any positive number (usually 10 or e).

Solving Logarithmic Equations

Solving logarithmic equations often involves rewriting them in exponential form or using logarithm properties for simplification.

Logarithmic Scale

A logarithmic scale is non-linear, used for representing a wide range of values. Examples include the Richter scale for earthquakes and the decibel scale for sound intensity.

Instance-based Algorithms

K-Nearest Neighbors Algorithm (KNN)

KNN is a non-parametric, lazy learning algorithm. It classifies new cases based on a similarity measure (e.g., distance functions). KNN has been used in statistical estimation and pattern recognition as a non-parametric technique.

Learning Vector Quantization (LVQ)

LVQ is a type of artificial neural network and a prototype-based supervised classification algorithm. It works by finding a set of prototypes representing each class in the feature space, and classifies vectors based on the closest prototype.

Self-Organizing Map (SOM)

A SOM is a type of unsupervised learning algorithm used to produce a low-dimensional (typically two-dimensional), discretized representation of higher-dimensional data, preserving topological properties. It's useful for visualization and exploring complex data structures.

Regression Analysis

Logistic Regression

Logistic Regression is used for binary classification problems. It models the probability of a binary outcome based on one or more predictor variables.

Ordinary Least Squares Regression (OLSR)

OLSR is a type of linear regression that estimates the parameters by minimizing the sum of the squared differences between observed and predicted values.

Linear Regression

Linear Regression is a basic form of regression analysis that models the relationship between a dependent variable and one or more independent variables using a linear function.

Stepwise Regression

Stepwise Regression is a method of fitting regression models in which the choice of predictive variables is carried out by an automatic procedure.

Multivariate Adaptive Regression Splines (MARS)

MARS is a non-parametric regression technique that models relationships that are not well captured by traditional linear or polynomial models.

Ridge Regression

Ridge Regression is a method of estimating the coefficients of multiple-regression models in scenarios where independent variables are highly correlated.

Least Absolute Shrinkage and Selection Operator (LASSO)

LASSO is a regression analysis method that performs both variable selection and regularization to enhance the prediction accuracy and interpretability of the statistical model.

Elastic Net

Elastic Net is a regularized regression method that linearly combines the L1 and L2 penalties of the LASSO and Ridge methods.

Least-angle Regression (LARS)

LARS is a regression algorithm for high-dimensional data, efficiently computing the entire regularization path for the LASSO.