CSCI 443

Types of Data

How data is classified determines which statistical methods are appropriate to use.

Two top-level categories: Categorical and Numeric. Categorical splits into Ordinal and Nominal. Numeric splits into Continuous and Discrete.

Categorical Data

Data divided into groups or categories. Values are labels or names, not measurements. Arithmetic (+, ×) on categories is meaningless.

Species (cat, dog), colors (red, blue), garment sizes (S, M, L), ice cream flavors.

Ordinal Categorical Data

Categorical data with a meaningful natural ordering. You can say one value comes before or after another, but differences between levels aren't necessarily equal and arithmetic still doesn't apply.

{S, M, L, XL}, {January, February, …, December}, {Strongly Disagree → Strongly Agree}, {1st, 2nd, 3rd place}.

Nominal Categorical Data

Categorical data with no meaningful ordering. Labels are purely names — you cannot say one is greater than another in any inherent sense.

{Cat, Dog, Fish}, {Vanilla, Chocolate, Strawberry}, {Truck, Car}, {Red, Blue, Green}.

Common trap: Just because you CAN alphabetize categories doesn't make them ordinal. "Chocolate" before "Vanilla" alphabetically is not a meaningful ordering. Ordinal requires an ordering inherent in the concept (e.g., size, rank, time).

Numeric Data

Continuous Numeric Data

Can take any real value within a range — infinitely many possible values between any two points. Measured, not counted. Described by a Probability Density Function (PDF).

Height (1.7312 m), temperature (−3.5°C), weight, time elapsed, electrical voltage.

Discrete Numeric Data

Can only take countable, distinct values — typically non-negative integers. You cannot have a fractional value. Described by a Probability Mass Function (PMF).

Number of customers (0, 1, 2, …), die roll result (1–6), emails per day, defective items in a batch.

Quick Reference Table

Type	Ordered?	Arithmetic?	Examples
Ordinal Categorical	Yes — meaningful	No	S/M/L, months, survey ratings
Nominal Categorical	No	No	Colors, flavors, pet species
Discrete Numeric	Yes	Yes	Counts, die rolls, number of heads
Continuous Numeric	Yes	Yes	Height, temperature, time

Random Experiments

The foundation of probability — understanding how uncertainty arises and is formalized.

Random Experiment

A process whose outcome is uncertain before it is performed, with several possible results. Must be repeatable (in principle) under the same conditions.

Rolling a die, drawing a card, flipping a coin, measuring tomorrow's temperature.

Trial

A single execution of the random experiment. Each trial produces exactly one outcome.

One flip of a coin = one trial. Each card draw from a deck = one trial.

Outcome

The result of a single trial. It is one element of the sample space.

Rolling a 3 on a die is an outcome. "Heads" on a coin flip is an outcome.

Sample Space (S or Ω)

The complete set of all possible outcomes of a random experiment. Every conceivable result must be listed.

Coin flip: S = {Heads, Tails}. Six-sided die: S = {1, 2, 3, 4, 5, 6}. Card draw: S = all 52 cards.

NOT a random experiment: Computing the area of a 1 m × 1 m square (result = 1 m² always — deterministic). For a random experiment, the outcome must be genuinely uncertain before the trial.

A computed mean is NOT an outcome: A class's average satisfaction score is a statistic aggregated from many measurements — not the result of a single trial.

Random Variables

A mathematical bridge that maps experimental outcomes to numbers on the real line.

Random Variable (X)

A function that maps outcomes to real numbers. For numeric data the mapping is direct (die shows 3 → X=3). For categorical data we assign numeric codes. We use capital letters (X, Y, Z) to denote random variables.

Discrete vs Continuous

Discrete Random Variable

Takes a countable set of distinct values (usually integers). There are gaps between possible values. Described by a PMF.

Number of heads in 10 flips (0–10), die roll (1–6), daily customer count (0, 1, 2, …).

Continuous Random Variable

Can take any real value in an interval — infinitely many possibilities. The probability of any exact value is 0; probabilities are computed over intervals. Described by a PDF.

Height, time between arrivals, temperature, weight, electrical resistance.

Quick test: Can the variable take fractional values meaningfully? Yes → continuous. Is it always a whole number by nature? → discrete. "3.7 customers" is impossible → discrete. "3.7 minutes" is perfectly valid → continuous.

Probability Functions

PMF — Probability Mass Function

For discrete variables. \(P[X=i]\) = probability X equals exactly i. Properties: P[X=i] ≥ 0 for all i, and \(\sum_{i \in S} P[X=i] = 1\).

PDF — Probability Density Function

For continuous variables. f(x) is probability density, not probability. Probability is the area under the curve: \(P[a \leq X \leq b] = \int_a^b f(x)\,dx\). Total area must equal 1.

CDF — Cumulative Distribution Function

\(F(x) = P[X \leq x]\). The probability that the variable is at most x. Always non-decreasing from 0 to 1. The CDF of the standard normal is written Φ(x).

Events

Any subset of the sample space we care about assigning a probability to.

Event (E)

Any subset of the sample space. An event "occurs" when the outcome of a trial falls within that subset.

Die roll S = {1,2,3,4,5,6}. Event "roll even" = {2,4,6}. Event "roll a 3" = {3}. Event "roll ≥ 4" = {4,5,6}.

Special events:
• Empty set ∅ — the impossible event, P[∅] = 0.
• The full sample space S — the certain event, P[S] = 1.
• Events can be infinite: "more than 100 customers" = {101, 102, 103, …} is a valid event.

Event ≠ Outcome. An outcome is a single element of S. An event is a subset (could contain one element, many, or none). "Rolling a 3" as an outcome is just 3; as an event it is {3}.

Axioms of Probability

Three Basic Axioms (Kolmogorov)

1. P[E] ≥ 0 for every event E.
2. P[S] = 1 (something always happens).
3. For mutually exclusive events E₁, E₂ (no overlap): P[E₁ ∪ E₂] = P[E₁] + P[E₂].

Distributions

How probability is spread across the possible values of a random variable.

Distribution

An assignment of probabilities to all possible outcomes. Requirements: (1) each probability ≥ 0, (2) all probabilities sum (or integrate) to 1.

Uniform Distribution

Discrete Uniform — all outcomes equally likely

For n outcomes: P[X = i] = 1/n for every outcome i.

Fair six-sided die: P[X=1] = … = P[X=6] = 1/6.

Continuous Uniform[a, b]

Constant density over [a, b], zero outside. PDF: \(f(x) = \tfrac{1}{b-a}\) for \(a \leq x \leq b\).

\[\mu = \frac{a+b}{2} \qquad \sigma^2 = \frac{(b-a)^2}{12}\]

Uniform[20,40]: mean = (20+40)/2 = 30, variance = (20)²/12 = 400/12 ≈ 33.33.

Gaussian (Normal) Distribution N(μ, σ)

Gaussian / Normal Distribution

The classic "bell curve." Two parameters: mean μ (center/location) and standard deviation σ (spread). Symmetric around μ. Total area under curve = 1.

\[f(x \mid \mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}}\exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\]

Key Properties of the Gaussian

• Increasing σ → curve becomes wider and flatter (same area = 1).
• Increasing μ → curve shifts right, shape unchanged.
• Empirical rule: ~68% of data within 1σ, ~95% within 2σ, ~99.7% within 3σ.
• Notation: N(μ, σ) means std dev = σ, so variance = σ².

As sample size grows (100 → 1,000 → 10,000), a relative-frequency histogram converges to the true PDF — this is the Law of Large Numbers in action.

Exponential Distribution (λ)

Exponential Distribution

Models time between events in a Poisson process (arrivals, radioactive decay). Rate parameter λ. PDF: \(f(x) = \lambda e^{-\lambda x}\) for x ≥ 0.

\[\mu = \frac{1}{\lambda} \qquad \sigma^2 = \frac{1}{\lambda^2}\]

f(x) = 0.1e^{−0.1x} → λ = 0.1, mean = 1/0.1 = 10.

Triangular Distribution Tri(a, b, c)

Triangular Distribution

Defined on [a, b] with peak (mode) at c. PDF rises linearly from a to c, falls linearly from c to b.

\[\mu = \frac{a+b+c}{3} \qquad \sigma^2 = \frac{a^2+b^2+c^2-ab-ac-bc}{18}\]

Absolute vs Relative Frequency

Absolute Frequency

Raw count of samples in a bin.

30 of 200 samples fall in bin [10,15] → absolute frequency = 30.

Relative Frequency

Count ÷ total samples = proportion. Approximates probability density as sample size → ∞.

30 / 200 = 0.15 = 15% relative frequency for that bin.

Descriptive Statistics

Measures that summarize the center and spread of a dataset.

Range

\[\text{Range} = \max(S) - \min(S)\]

S=[−1,3,−4,2,6]: Range = 6 − (−4) = 10.

Mean (Arithmetic Average)

Mean x̄

Sum of all values divided by n. Sensitive to outliers — a single extreme value can pull it far from where most data sits.

\[\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i\]

S=[−1,3,−4,2,6]: Sum = 6, Mean = 6/5 = 1.2.

Median

Middle value after sorting. Robust to outliers — extreme values in the tails do not affect it.

• n odd: median = element at index ⌊n/2⌋ (0-indexed).
• n even: median = average of elements at indices n/2−1 and n/2.

Sorted [−4,−1,2,3,6] (n=5 odd) → Median = 2. Sorted [−1,0,7,7,9,10] (n=6 even) → Median = (7+7)/2 = 7.

Percentile — Linear Interpolation Method

p-th Percentile (NumPy "linear" default)

Steps:
1. Sort the data into S′.
2. Fractional index (0-indexed): \(i = \tfrac{p}{100} \times (n-1)\)
3. If i is an integer → P_p = S′[i].
4. If i is fractional: \(P_p = S'[\lfloor i\rfloor] + (i - \lfloor i\rfloor)\cdot(S'[\lceil i\rceil] - S'[\lfloor i\rfloor])\)

S′=[−4,−1,2,3,6], n=5. 20th %ile: i = 0.20×4 = 0.8. Between S′[0]=−4 and S′[1]=−1: −4 + 0.8×(−1−(−4)) = −4 + 2.4 = −1.6.

Trimmed Mean

Remove a fixed proportion from each tail of the sorted data, then compute the mean of what remains. Robust to extreme outliers in the tails. scipy.stats.trim_mean(data, p) removes proportion p from each end.

S = [−1,0,3,6,7,7,8,9,10,1000], 10% trimmed: remove 1 from each end → [0,3,6,7,7,8,9,10], mean = 50/8 = 6.25.

Robustness ranking (most to least): Median ≈ Trimmed Mean > Mean. Extreme percentiles (90th, 95th) are unreliable near outliers.

Measures of Dispersion

How spread out the data is — and how robust each measure is to outliers.

Why not use the sum of deviations? It always equals zero:
\(\frac{1}{n}\sum_{i=1}^n (X_i - \bar{X}) = \bar{X} - \bar{X} = 0\)
Positive and negative deviations cancel perfectly. We need all deviations to be positive.

Mean Absolute Deviation

\[\text{Mean Abs Dev} = \frac{1}{n}\sum_{i=1}^n |X_i - \bar{X}|\]

Average of absolute deviations from the mean. Uses absolute values so nothing cancels. Less sensitive to large errors than variance, but not as robust as Median Absolute Deviation.

Median Absolute Deviation (MAD)

\[\text{MAD} = \text{median}(|x_i - \text{median}(x)|)\]

The median of the absolute deviations from the median. Very robust — uses the median twice (once to center, once to summarize). Outliers have minimal impact.

Variance & Standard Deviation

Population Variance σ² (use when you have ALL the data)

Divide by n (the full population size).

\[\sigma^2 = \frac{1}{n}\sum_{i=1}^n (X_i - \mu)^2\]

Sample Variance s² — Bessel's Correction (use for a sample)

Divide by (n−1). Corrects the underestimation bias introduced by using x̄ (estimated from the same data) instead of the true μ.

\[s^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2\]

Units of variance: If data is in °C, variance is in °C². This is why variance is hard to interpret — the unit is squared. Standard deviation (√variance) restores the original unit.

Interquartile Range (IQR)

\[\text{IQR} = Q_3 - Q_1 = P_{75} - P_{25}\]

Range of the middle 50% of data. Most robust of all dispersion measures — extreme values in the tails have zero effect because they aren't used in the computation.

Robustness Comparison

Measure	Robustness to Outliers	Why
IQR	Most robust	Uses only middle 50%; ignores tails
Median Abs Dev (MAD)	Very robust	Uses median twice
Mean Abs Dev	Moderate	Uses the mean, which outliers pull
Sample Std Dev (s)	Sensitive	Squared deviations amplify outliers
Variance (s²)	Most sensitive	Squaring makes one outlier dominant

Box-and-Whisker Plot (Tukey Convention)

Box Plot

• Box spans Q1 to Q3 (the IQR).
• Line inside the box = median (Q2).
• Whiskers extend to the last data point within 1.5×IQR of Q1 (lower) and Q3 (upper).
• Points beyond whiskers are plotted as individual dots — these are flagged as outliers.

Outliers & Skew

How extreme values distort statistics and reveal the shape of a distribution.

Skewness

Right Skew (Positive Skew)

Long tail on the right. Large outliers pull the mean right of the median. Rule of thumb: mean > median → right skew.

S=[3,8,6,9,−1,10,1000,7,7,0]: mean=104.9, median=7. Mean ≫ median → strongly right skewed.

Left Skew (Negative Skew)

Long tail on the left. Small outliers pull the mean below the median. Rule of thumb: mean < median → left skew.

Symmetric (No Skew)

Mean ≈ median. The Gaussian distribution is perfectly symmetric.

Effect of One Outlier on Each Statistic

Statistic	Outlier Effect	Robust?
Mean	Strongly pulled toward outlier	No
Median	Little to no effect	Yes
Trimmed Mean (10%)	Removed if outlier is in the tail	Yes
Variance / Std Dev	Heavily inflated (squared deviations)	No
IQR	No effect (outlier is in the tail)	Yes
Extreme %iles (90th+)	Heavily distorted by interpolation	No
Median percentile (50th)	Same as median — robust	Yes

Extreme percentiles near outliers are unreliable. If the 90th percentile calculation interpolates between a normal data point and an outlier, the result is wildly inflated. Be cautious with percentiles above ~85th when outliers may exist.

Worked example — S = [3,8,6,9,−1,10,1000,7,7,0], sorted: [−1,0,3,6,7,7,8,9,10,1000]
Mean = 1049/10 = 104.9 | Median = (7+7)/2 = 7
80th %ile: i = 0.8×9 = 7.2 → 9 + 0.2×(10−9) = 9.2
90th %ile: i = 0.9×9 = 8.1 → 10 + 0.1×(1000−10) = 109 ← distorted by outlier!

Population vs Sample Statistics

The critical difference in formulas — and why Bessel's correction exists.

Population

The complete set of all subjects of interest. You know the true parameters (μ, σ²). Use n in the denominator.

Sample

A subset drawn to make inferences about the whole. Sample statistics estimate population parameters. Use (n−1) in variance (Bessel's correction).

Formulas Side by Side

Statistic	Population (÷ N)	Sample (÷ n−1)
Mean	\(\mu = \frac{1}{N}\sum X_i\)	\(\bar{x} = \frac{1}{n}\sum X_i\)
Variance	\(\sigma^2 = \frac{\sum(X_i-\mu)^2}{N}\)	\(s^2 = \frac{\sum(X_i-\bar{x})^2}{n-1}\)
Std Dev	\(\sigma = \sqrt{\sigma^2}\)	\(s = \sqrt{s^2}\)

Bessel's Correction — Why n−1?

When computing sample variance we use x̄ (estimated from the same data) instead of the true μ. The sample mean always sits at the center of the sample, making deviations from x̄ systematically smaller than deviations from the true μ. Dividing by (n−1) instead of n corrects this downward bias, making s² an unbiased estimator of σ².

Worked — S = {4, 7, 9, 12, 15, −9}, mean = 38/6 ≈ 6.33
Sum of squared deviations ≈ 355.33
Population variance (÷6) ≈ 59.22 | Sample variance (÷5) ≈ 71.07

Z-scores, Φ (Phi) & Erf

Computing probabilities under any Gaussian by standardizing to N(0,1).

Z-Score

Z-Score (Standardized Score)

Converts a raw value x from X~N(μ,σ) to a standard normal score: how many standard deviations x is above or below the mean.

\[Z = \frac{x - \mu}{\sigma}\]

Test: μ=500, σ=100, x=620 → Z = (620−500)/100 = 1.2. John is 1.2 standard deviations above the mean.

Φ (Phi) — CDF of N(0,1)

Φ(z) — Phi Function

\(\Phi(z) = P[Z \leq z]\) where Z~N(0,1). It's the cumulative area under the standard normal bell curve from −∞ to z. To use for any Gaussian: compute Z first, then look up Φ(Z).

Φ(0) = 0.500, Φ(1) ≈ 0.841, Φ(1.2) ≈ 0.885, Φ(2) ≈ 0.977.

Error Function (erf)

\[\text{erf}(x) = \frac{2}{\sqrt{\pi}}\int_0^x e^{-t^2}\,dt\]

\[\Phi(z) = \frac{1}{2}\!\left[1 + \text{erf}\!\left(\frac{z}{\sqrt{2}}\right)\right]\]

Computing Probabilities

P(X < x) — One-tailed left

Compute Z = (x−μ)/σ, then P = Φ(Z).

P(score < 620): Z=1.2, Φ(1.2) ≈ 88.5%.

P(X > x) — One-tailed right

P(X > x) = 1 − Φ(Z).

P(a < X < b) — Interval probability

P(a < X < b) = Φ(Z_b) − Φ(Z_a), where Z_a=(a−μ)/σ and Z_b=(b−μ)/σ.

P(|X−μ| > k·σ) — Two-tailed outside ±kσ

P = 1 − erf(k/√2).

Steel rods: μ=100, σ=2. P(X<96 or X>104): k=2. P = 1−erf(√2) ≈ 1−0.9545 = 4.55%.

Key Φ Values to Memorize

Z	Φ(Z) ≈	Meaning
0	0.500	50% below mean
1.0	0.841	±1σ covers 68.3% of data
1.2	0.885	John's exam example
1.645	0.950	90th percentile of N(0,1)
1.96	0.975	95% CI boundary (two-sided)
2.0	0.977	±2σ covers 95.4% of data
3.0	0.9987	±3σ covers 99.7% of data

Covariance & Pearson Correlation

Measuring how two variables move together — direction and strength.

Covariance

\[\text{Cov}(X,Y) = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})\]

Interpreting the Sign of Cov(X,Y)

• Cov > 0: When X is above its mean, Y tends to be above its mean (move together).
• Cov < 0: When X is high, Y tends to be low (move opposite).
• Cov ≈ 0: No consistent linear relationship.

Covariance is scale-dependent — its magnitude depends on the units of X and Y. A covariance of 425 between height(cm) and weight(kg) is not comparable to 425 between two other variables. That is why we normalize to get Pearson r.

Pearson Correlation Coefficient (r)

\[r = \frac{\text{Cov}(X,Y)}{\sigma_X \cdot \sigma_Y}\]

Normalizes covariance by the product of standard deviations. Always in [−1, +1]. Scale-independent (unit-free).

r value	Interpretation
+1	Perfect positive linear relationship
0.7 to 1	Strong positive correlation
0 to 0.3	Weak positive correlation
≈ 0	No linear relationship (may have nonlinear!)
−1 to 0	Negative correlation
−1	Perfect negative linear relationship

r = 0 does NOT mean no relationship! Y = X² gives r ≈ 0 (symmetric parabola — positive and negative halves cancel). Pearson r only detects linear relationships. Always plot the data with a scatter plot.

Y = constant → r is undefined. σ_Y = 0 means division by zero in the formula. A constant has no linear relationship with anything.

Worked — X=[1,3,5,7], Y=[8,6,5,3], μ_X=4, μ_Y=5.5
Deviations X: −3,−1,+1,+3 | Deviations Y: +2.5,+0.5,−0.5,−2.5
Products: −7.5, −0.5, −0.5, −7.5 → sum = −16
Cov = −16/3 ≈ −5.33 | σ_X=√(20/3), σ_Y=√(13/3)
r = −16/√260 ≈ −0.992 (strong negative).

Bias & Error Types

Understanding systematic and random sources of error in data collection.

Random Error

Unpredictable, unsystematic variation in measurements. Averages out over many trials. Not a flaw — every measurement has some random component.

A scale gives slightly different readings each time you weigh the same 100 g object due to vibrations.

Systematic Error (Bias)

Consistent, repeatable error in the same direction. Does not average out. Indicates a flaw in the study design, instrument, or data collection process.

A miscalibrated scale that always reads 5% too high.

Types of Bias

Measurement Bias

Systematic error in how data is measured — the instrument or procedure introduces consistent errors in one direction.

A laser rangefinder miscalibrated by ×1.2 → all reported distances are 20% too large.

Observer Bias

The researcher's expectations or prior beliefs influence how they collect, record, or interpret data. Also called experimenter bias or confirmation bias.

A police officer who expects a high-crime neighborhood interprets ambiguous behavior as suspicious.

Selection Bias

The sample is not representative of the target population because some groups are systematically over- or under-represented in the selection process.

A pollster calls only landlines during daytime → misses mobile-only users and working people → biased sample.

Non-Response Bias

Participants who do not respond differ systematically from those who do, making the final sample non-representative even if selection was random.

University survey with only 30% response rate — students with strong opinions (positive or negative) are more likely to respond.

Self-Reporting Bias

Participants inaccurately report their own behaviors or attitudes due to memory errors, misunderstanding, or discomfort with truthful answers.

Participants in a weight-loss study self-report monthly weight loss — imperfect memory leads to inaccuracies.

Social Desirability Bias

Participants respond in ways they believe are socially acceptable or will reflect well on them, rather than truthfully.

People over-report exercise frequency and charitable donations; under-report alcohol use and unhealthy eating.

Multiple biases can co-occur. Self-reporting bias and social desirability bias often appear together. Selection bias and non-response bias can affect the same study simultaneously.

Quick Reference

Bias Type	Who is biased?	Key Signal
Measurement	The instrument	Consistent systematic offset in all readings
Observer	The researcher	Prior expectations alter observations
Selection	The sampling process	Some population groups excluded
Non-response	The respondents	Low response rate; non-responders differ
Self-reporting	The participant	Inaccurate recall or reporting
Social desirability	The participant	Reporting what sounds good, not true

Sampling Distributions & CLT

Why sample means behave predictably — even when individual data doesn't.

Sampling Distribution

The distribution of a statistic (e.g., sample mean x̄) across many repeated samples from the same population. You draw n samples, compute x̄; repeat m times → the collection of m sample means follows the sampling distribution of the mean.

Central Limit Theorem (CLT)

Central Limit Theorem

For i.i.d. samples from any distribution with finite mean μ and variance σ², the distribution of the sample mean x̄ approaches Normal as n → ∞, regardless of the underlying distribution's shape.

\[\bar{X} \xrightarrow{d} N\!\!\left(\mu,\; \frac{\sigma^2}{n}\right) \quad \text{as } n \to \infty\]

Key CLT facts:
• Works for ANY underlying distribution (log-normal, exponential, uniform, …) as long as it has finite variance.
• As n (samples per mean) increases → distribution of x̄ gets narrower and more bell-shaped.
• The mean of the sampling distribution = population mean μ (unbiased).
• More trials m (more sample means computed) → smoother histogram. More samples per mean n → narrower bell.

CLT applies to sample MEANS, not individual samples. Individual samples always follow the underlying distribution. Only the distribution of x̄ becomes approximately Normal.

Standard Error of the Mean

\[\text{SE} = \frac{\sigma}{\sqrt{n}}\]

The standard deviation of the sampling distribution of x̄. Measures how much sample means vary across repeated experiments. As n grows, SE shrinks → your estimate of μ becomes more precise.

Standard Deviation vs Standard Error

	Standard Deviation (σ)	Standard Error (SE)
Measures	Spread of individual data points	Spread of sample means across experiments
Formula	\(\sqrt{\frac{\sum(x_i-\bar{x})^2}{n-1}}\)	\(\sigma/\sqrt{n}\)
As n → ∞	Stays the same (individual variation is real)	→ 0 (means converge to μ)
Use when	Describing how variable individuals are	Describing precision of the mean estimate

Spark & PySpark

Distributed computing for datasets too large for a single machine.

Core Concepts

SparkSession

The entry point to all Spark functionality. Wraps a SparkContext (which coordinates with the cluster). In PySpark: spark = SparkSession.builder.getOrCreate(). Use it to read data, execute SQL, and create DataFrames.

RDD — Resilient Distributed Dataset

Spark's fundamental data structure:
• Immutable: cannot be changed after creation; transformations produce new RDDs.
• Distributed: partitioned across cluster nodes for parallel processing.
• Fault-tolerant: lost partitions can be recomputed from lineage (transformation history).
DataFrames are built on top of RDDs.

DataFrame

A distributed dataset organized into named, typed columns (like a table). The preferred API in modern PySpark. Enables the Catalyst optimizer to work most effectively.

Lazy Evaluation

Transformations are not executed when called — they add nodes to a logical query plan. Execution is deferred until an Action is called. This allows Spark to optimize the entire pipeline before running any computation.

Transformations (Lazy — do not trigger execution)

Method	Purpose
`df.select("col1","col2")`	Choose specific columns
`df.filter("condition")` / `df.where(...)`	Filter rows by condition
`df.groupBy("col").agg(...)`	Group rows and aggregate
`df.withColumn("new", expr)`	Add or replace a column
`df.orderBy("col")`	Sort rows
`df.join(other, on, how)`	Join two DataFrames

Actions (Trigger full execution)

Method	Purpose
`df.show(n)`	Print first n rows to screen
`df.count()`	Count total rows
`df.collect()`	Return all rows to driver as Python list
`df.first()`	Return first row
`df.write.*()`	Write results to storage

Catalyst Optimizer

Spark's query optimizer that processes the logical plan before execution:
• Predicate pushdown: moves filter conditions close to the data source — reads less data.
• Projection pruning: drops unused columns early.
• Join reordering: reorders joins for efficiency.
Generates multiple physical plans, picks the cheapest (cost estimation), and compiles to JVM bytecode.

NumPy vs Pandas vs Spark

Library	Best For	Limitation
NumPy	Fast math on arrays; SIMD-vectorized; C backend	Single machine only
Pandas	Tabular data manipulation on one machine; rich API	Single machine; limited by RAM
Spark	Datasets too big for one machine; cluster-scale	Overhead: scheduling, network, serialization — overkill for small data

Python is slow for raw computation (heap allocation per object, GIL, no SIMD). But NumPy, Spark, and PyTorch do the heavy math in compiled C/Scala/CUDA — Python is just the glue/API layer.

Common PySpark Patterns

# Select specific columns
df.select("PassengerId", "Survived", "Pclass").show(5)

# Filter rows (two equivalent ways)
df.filter((col("Survived")==1) & (col("Pclass")==1)).show()
df.filter("Survived == 1").filter("Pclass == 1").show()

# Add a new column
df.withColumn("FamilySize", col("SibSp") + col("Parch")).show()

# Group and aggregate
df.groupBy("Pclass").count().orderBy("count", ascending=False).show()

# Read a CSV file
df = spark.read.csv("/path/file.csv", header=True, inferSchema=True)

Object Classes

A regular object class in Java has exactly 4 types of methods.

4 types: Constructors · Accessors (getters) · Mutators (setters) · toString

Constructors

Constructor

Initializes a new object. No return type (not even void). Same name as the class. A class can have multiple constructors (overloading). A no-arg constructor takes no parameters.

public Person(String name, int age){ this.name = name; this.age = age; }

Accessors (Getters)

Accessor / Getter

Returns the value of a private field. Return type matches the field type. No parameters. Name convention: getFieldName().

public String getName(){ return name; }
public int getAge(){ return age; }

Mutators (Setters)

Mutator / Setter

Sets/updates the value of a private field. Return type is always void. Takes one parameter matching the field type. Name convention: setFieldName(type param).

public void setName(String name){ this.name = name; }
public void setAge(int age){ this.age = age; }

toString

toString()

Returns a String representation of the object. Return type is always String. No parameters. Called automatically when you print an object.

public String toString(){
  return "Person: " + name + ", Age: " + age;
}

Quick Reference

Method Type	Return Type	Parameters	Purpose
Constructor	none (not void)	0 or more	Initialize object
Accessor	matches field type	none	Get field value
Mutator	void	1 (matching field)	Set field value
toString	String	none	Text representation

The `this` Keyword & Shadowing

Shadowing is one of the most common bugs in object classes.

Shadowing

Occurs when a local variable or parameter has the same name as an instance field. The local name "shadows" (hides) the field — assignments go to the local variable, not the field.

// BUG — shadowing! name = name assigns param to itself
public void setName(String name){ name = name; }

// FIX — use this.name
public void setName(String name){ this.name = name; }

Two ways to prevent shadowing:
1. Use the this. keyword to explicitly refer to the instance field.
2. Use different names for the parameter and the field (e.g., field: name, param: n).

The this keyword

Refers to the current instance of the class. this.fieldName always means the instance field, not a local variable. Also used to call another constructor: this(args).

Method Overloading & Return Types

Same name, different parameters — Java picks the right version automatically.

Method Overloading

Defining multiple methods with the same name but different parameter lists (different number, type, or order of parameters). Return type alone cannot distinguish overloads.

public void print(int x){ ... }       // version 1
public void print(String s){ ... }    // version 2 — different param type
public void print(int x, int y){ ... } // version 3 — different param count

Return type alone cannot overload. Two methods with the same name and same parameters but different return types will cause a compile error.

Return Types & return keyword

Every non-void method must return a value with the return statement. The return type in the method signature must match exactly. void methods do not return values (but can use bare return; to exit early).

public int add(int a, int b){ return a + b; }   // returns int
public String greet(){ return "Hello"; }         // returns String
public void print(){ System.out.println("Hi"); } // void, no return value

The `static` Keyword

Static members belong to the class, not any particular instance.

static field / variable

A single shared memory location for all instances of the class. Regardless of how many objects are created, there is only one copy of a static field. Accessed via the class name: ClassName.fieldName.

public class Counter {
  public static int count = 0;  // one copy, shared
  public Counter(){ count++; }  // every new object increments the same count
}

static method

Called on the class (not an object). Cannot access instance fields (non-static). Good for utility functions that don't depend on object state.

public static int square(int n){ return n * n; }
// called as: ClassName.square(5);

When to use static: constants (static final), counters shared across all objects, utility/helper methods like Math.max().

A static method cannot call a non-static method or use this — there is no object instance in scope.

Aggregation (Has-A Relationship)

One class uses another class as a field — modeling real-world ownership.

Aggregation

When a class contains an object of another class as a field. This models a "has-a" relationship. Example: a University has a Person array.

public class University {
  private String name;
  private Person[] faculty;  // aggregation: University HAS-A Person[]
  private int count;
  ...
}

Has-A relationship: The containing class "has" objects of another class. Distinguish from "Is-A" (inheritance) — e.g., "a Dog is an Animal."

Add Method in Intermediate (Aggregation) Class

To add an object to the internal array, you instantiate the object inside the method and store it at the logical index (counter), then increment the counter. The getter for the array is not always needed — but an add method is.

public void addPerson(String name, int age){
  faculty[count] = new Person(name, age);
  count++;
}

Arrays

Fixed-size, ordered collection of elements of the same type.

Declaration & Instantiation

type[] name = new type[size]; — allocates a fixed block of memory. Size cannot change after creation.

int[] scores = new int[5];       // size 5, all zeros
String[] names = new String[3];  // size 3, all null

Default Values

Numeric types default to 0, booleans to false, objects/Strings to null.

Physical vs Logical Length

Physical length = arr.length — total slots allocated.
Logical length = a separate counter variable tracking how many elements have actually been added. Always ≤ physical length.

int[] data = new int[10];  // physical = 10
int count = 0;             // logical starts at 0
data[count] = 42; count++; // logical becomes 1

ArrayIndexOutOfBoundsException — thrown when you access index < 0 or ≥ arr.length. Always use logical length as the loop bound when the array isn't full.

Iterating

// Print all elements (full array):
for(int i = 0; i < arr.length; i++){
  System.out.println(arr[i]);
}
// Enhanced for-loop:
for(int val : arr){ System.out.println(val); }

ArrayList

Dynamic array that grows and shrinks automatically. From java.util.ArrayList.

Declaration & Instantiation

import java.util.ArrayList;
ArrayList<String> list = new ArrayList<String>();

Key Methods

Method	Purpose
`list.add(element)`	Append element to end
`list.add(index, element)`	Insert at position
`list.get(index)`	Get element at index
`list.set(index, element)`	Replace element at index
`list.remove(index)`	Remove element at index
`list.size()`	Number of elements (logical size)
`list.contains(obj)`	Check if element exists

ArrayList vs Array: ArrayList is dynamic (resizes automatically), can only hold objects (not primitives — use wrappers). Arrays have fixed size, can hold primitives, use .length. ArrayList uses .size().

ArrayList cannot hold primitives directly. Use ArrayList<Integer> not ArrayList<int>. Autoboxing converts automatically.

Wrapper Classes

Object versions of Java primitives. Required for collections like ArrayList.

Primitive	Wrapper Class
int	Integer
double	Double
char	Character
boolean	Boolean
long	Long
float	Float

Useful Methods

Integer.parseInt("42")     // String → int
Integer.toString(42)       // int → String
Double.parseDouble("3.14") // String → double
Character.isLetter('a')    // true
Character.toUpperCase('a') // 'A'

Autoboxing & Unboxing

Autoboxing: Java automatically converts a primitive to its wrapper when needed (e.g., adding int to ArrayList<Integer>).
Unboxing: automatic conversion from wrapper back to primitive.

ArrayList<Integer> nums = new ArrayList<>();
nums.add(5);      // autoboxing: int 5 → Integer(5)
int x = nums.get(0); // unboxing: Integer(5) → int 5

String Methods

Strings are objects in Java — immutable and full of useful methods.

Method	Returns	Description
`str.length()`	int	Number of characters
`str.charAt(i)`	char	Character at index i
`str.substring(a,b)`	String	Chars from index a up to (not including) b
`str.toUpperCase()`	String	All uppercase
`str.toLowerCase()`	String	All lowercase
`str.trim()`	String	Remove leading/trailing whitespace
`str.equals(other)`	boolean	Content equality (use instead of ==)
`str.equalsIgnoreCase(other)`	boolean	Case-insensitive equality
`str.contains(s)`	boolean	True if s is a substring
`str.indexOf(s)`	int	First index of s, or -1
`str.replace(old,new)`	String	Replace all occurrences
`str.split(regex)`	String[]	Split into array by delimiter
`String.valueOf(x)`	String	Convert any type to String

Concatenation: Use + to join strings. "Hello" + " " + "World" → "Hello World". Any type concatenated with a String is automatically converted.

Use .equals() not == to compare String content. == compares references (memory addresses), not values.

Formatting Output

Three ways to control decimal precision and field width.

printf / String.format Format Specifiers

Specifier	Meaning	Example
`%d`	Integer	`printf("%d", 42)` → `42`
`%f`	Floating point	`printf("%f", 3.14)` → `3.140000`
`%.2f`	2 decimal places	`printf("%.2f", 3.14159)` → `3.14`
`%8d`	Width 8, right-align	pads with spaces on left
`%-8d`	Width 8, left-align	pads with spaces on right
`%s`	String	`printf("%s", "hi")` → `hi`
`%n`	Newline	platform-independent newline

DecimalFormat

From java.text.DecimalFormat. Formats numbers using a pattern string.

import java.text.DecimalFormat;
DecimalFormat df = new DecimalFormat("0.00");
System.out.println(df.format(3.14159)); // → 3.14

DecimalFormat df2 = new DecimalFormat("#,##0.00");
System.out.println(df2.format(1234567.8)); // → 1,234,567.80

String.format

Same specifiers as printf but returns a String instead of printing.

String s = String.format("%.2f", 3.14159); // s = "3.14"
String t = String.format("%-10s|", "Hi");  // t = "Hi        |"

Primitives vs Objects & Pass by Value/Reference

Understanding how Java passes data to methods.

The 8 Primitive Types

byte · short · int · long · float · double · boolean · char

Primitive vs Object

Primitives store the value directly in memory. Objects store a reference (address) to the object on the heap. Strings are objects, not primitives.

Pass by Value (Primitives)

Pass by Value

A copy of the value is passed to the method. Changes inside the method do NOT affect the original variable.

void addTen(int x){ x += 10; } // only changes the copy
int n = 5;
addTen(n);
System.out.println(n); // still 5!

Pass by Reference (Objects)

Pass by Reference

A copy of the reference (address) is passed. The method accesses the same object. Changes to the object's fields ARE visible outside the method.

void rename(Person p){ p.setName("Bob"); }
Person person = new Person("Alice", 20);
rename(person);
System.out.println(person.getName()); // "Bob" — object was changed!

Java always passes by value. For objects, the value passed is the reference (address). Reassigning the parameter inside the method doesn't affect the caller's variable — but modifying the object via that reference does.

Random Class & Nested Loops

Generating random numbers and using nested iteration.

Random Class

import java.util.Random;
Random rng = new Random();

rng.nextInt(n)         // random int: 0 to n-1 (inclusive)
rng.nextInt(10)        // 0–9
rng.nextInt(10) + 1    // 1–10 (shift range up)
rng.nextInt(6) + 1     // 1–6 (simulate die roll)

// General range [min, max] inclusive:
rng.nextInt(max - min + 1) + min

Formula for range [a, b] inclusive: rng.nextInt(b - a + 1) + a

Nested For Loops

An inner loop runs completely for each iteration of the outer loop. Total iterations = outer count × inner count.

for(int i = 1; i <= 3; i++){
  for(int j = 1; j <= 3; j++){
    System.out.print(i * j + "\t");
  }
  System.out.println();
}
// Output: multiplication table 1..3

JUnit Testing

Testing individual methods in isolation.

Test Structure

import org.junit.jupiter.api.Test;
import static org.junit.jupiter.api.Assertions.*;

public class PersonTest {
  @Test
  public void testGetName(){
    Person p = new Person("Alice", 20);
    assertEquals("Alice", p.getName());
  }
}

Common Assert Methods

Assert	What it checks
`assertEquals(expected, actual)`	Two values are equal
`assertNotEquals(a, b)`	Two values are not equal
`assertTrue(condition)`	Condition is true
`assertFalse(condition)`	Condition is false
`assertNull(obj)`	Object is null
`assertNotNull(obj)`	Object is not null

Always test getters right after construction. Test setters by setting a value then getting it back. Test toString for expected string output.

Each test method must be annotated with @Test. If an assertion fails, the test immediately stops and reports failure. One assert per test is a good practice.

Study Hub

CSCI 112

CSCI 443 Quiz

Round Complete!

Types of Data

Categorical Data

Numeric Data

Quick Reference Table

Random Experiments

Random Variables

Discrete vs Continuous

Probability Functions

Events

Axioms of Probability

Distributions

Uniform Distribution

Gaussian (Normal) Distribution N(μ, σ)

Exponential Distribution (λ)

Triangular Distribution Tri(a, b, c)

Absolute vs Relative Frequency

Descriptive Statistics

Range

Mean (Arithmetic Average)

Median

Percentile — Linear Interpolation Method

Trimmed Mean

Measures of Dispersion

Mean Absolute Deviation

Median Absolute Deviation (MAD)

Variance & Standard Deviation

Interquartile Range (IQR)

Robustness Comparison

Box-and-Whisker Plot (Tukey Convention)

Outliers & Skew

Skewness

Effect of One Outlier on Each Statistic

Population vs Sample Statistics

Formulas Side by Side

Z-scores, Φ (Phi) & Erf

Z-Score

Φ (Phi) — CDF of N(0,1)

Error Function (erf)

Computing Probabilities

Key Φ Values to Memorize

Covariance & Pearson Correlation

Covariance

Pearson Correlation Coefficient (r)

Bias & Error Types

Types of Bias

Quick Reference

Sampling Distributions & CLT

Central Limit Theorem (CLT)

Standard Error of the Mean

Standard Deviation vs Standard Error

Spark & PySpark

Core Concepts

Lazy Evaluation

Transformations (Lazy — do not trigger execution)

Actions (Trigger full execution)

Catalyst Optimizer

NumPy vs Pandas vs Spark

Common PySpark Patterns

Object Classes

Constructors

Accessors (Getters)

Mutators (Setters)

toString

Quick Reference

The this Keyword & Shadowing

Method Overloading & Return Types

The static Keyword

Aggregation (Has-A Relationship)

Arrays

ArrayList

Key Methods

Wrapper Classes

Useful Methods

String Methods

Formatting Output

The `this` Keyword & Shadowing

The `static` Keyword