Study Hub

Select a class to start practicing

📊

CSCI 443

Advanced Data Science
Midterm Prep · 70+ Questions

CSCI 112

Intro to Java
Exam 1 Prep · 55+ Questions

CSCI 443 Quiz

Advanced Data Science · Midterm Prep · Every Homework Question

✓ 0  ✗ 0

Round Complete!

Types of Data

How data is classified determines which statistical methods are appropriate to use.

Two top-level categories: Categorical and Numeric. Categorical splits into Ordinal and Nominal. Numeric splits into Continuous and Discrete.

Categorical Data

Categorical Data
Data divided into groups or categories. Values are labels or names, not measurements. Arithmetic (+, ×) on categories is meaningless.
Species (cat, dog), colors (red, blue), garment sizes (S, M, L), ice cream flavors.
Ordinal Categorical Data
Categorical data with a meaningful natural ordering. You can say one value comes before or after another, but differences between levels aren't necessarily equal and arithmetic still doesn't apply.
{S, M, L, XL}, {January, February, …, December}, {Strongly Disagree → Strongly Agree}, {1st, 2nd, 3rd place}.
Nominal Categorical Data
Categorical data with no meaningful ordering. Labels are purely names — you cannot say one is greater than another in any inherent sense.
{Cat, Dog, Fish}, {Vanilla, Chocolate, Strawberry}, {Truck, Car}, {Red, Blue, Green}.
Common trap: Just because you CAN alphabetize categories doesn't make them ordinal. "Chocolate" before "Vanilla" alphabetically is not a meaningful ordering. Ordinal requires an ordering inherent in the concept (e.g., size, rank, time).

Numeric Data

Continuous Numeric Data
Can take any real value within a range — infinitely many possible values between any two points. Measured, not counted. Described by a Probability Density Function (PDF).
Height (1.7312 m), temperature (−3.5°C), weight, time elapsed, electrical voltage.
Discrete Numeric Data
Can only take countable, distinct values — typically non-negative integers. You cannot have a fractional value. Described by a Probability Mass Function (PMF).
Number of customers (0, 1, 2, …), die roll result (1–6), emails per day, defective items in a batch.

Quick Reference Table

TypeOrdered?Arithmetic?Examples
Ordinal CategoricalYes — meaningfulNoS/M/L, months, survey ratings
Nominal CategoricalNoNoColors, flavors, pet species
Discrete NumericYesYesCounts, die rolls, number of heads
Continuous NumericYesYesHeight, temperature, time

Random Experiments

The foundation of probability — understanding how uncertainty arises and is formalized.

Random Experiment
A process whose outcome is uncertain before it is performed, with several possible results. Must be repeatable (in principle) under the same conditions.
Rolling a die, drawing a card, flipping a coin, measuring tomorrow's temperature.
Trial
A single execution of the random experiment. Each trial produces exactly one outcome.
One flip of a coin = one trial. Each card draw from a deck = one trial.
Outcome
The result of a single trial. It is one element of the sample space.
Rolling a 3 on a die is an outcome. "Heads" on a coin flip is an outcome.
Sample Space (S or Ω)
The complete set of all possible outcomes of a random experiment. Every conceivable result must be listed.
Coin flip: S = {Heads, Tails}. Six-sided die: S = {1, 2, 3, 4, 5, 6}. Card draw: S = all 52 cards.
NOT a random experiment: Computing the area of a 1 m × 1 m square (result = 1 m² always — deterministic). For a random experiment, the outcome must be genuinely uncertain before the trial.
A computed mean is NOT an outcome: A class's average satisfaction score is a statistic aggregated from many measurements — not the result of a single trial.

Random Variables

A mathematical bridge that maps experimental outcomes to numbers on the real line.

Random Variable (X)
A function that maps outcomes to real numbers. For numeric data the mapping is direct (die shows 3 → X=3). For categorical data we assign numeric codes. We use capital letters (X, Y, Z) to denote random variables.

Discrete vs Continuous

Discrete Random Variable
Takes a countable set of distinct values (usually integers). There are gaps between possible values. Described by a PMF.
Number of heads in 10 flips (0–10), die roll (1–6), daily customer count (0, 1, 2, …).
Continuous Random Variable
Can take any real value in an interval — infinitely many possibilities. The probability of any exact value is 0; probabilities are computed over intervals. Described by a PDF.
Height, time between arrivals, temperature, weight, electrical resistance.
Quick test: Can the variable take fractional values meaningfully? Yes → continuous. Is it always a whole number by nature? → discrete. "3.7 customers" is impossible → discrete. "3.7 minutes" is perfectly valid → continuous.

Probability Functions

PMF — Probability Mass Function
For discrete variables. \(P[X=i]\) = probability X equals exactly i. Properties: P[X=i] ≥ 0 for all i, and \(\sum_{i \in S} P[X=i] = 1\).
PDF — Probability Density Function
For continuous variables. f(x) is probability density, not probability. Probability is the area under the curve: \(P[a \leq X \leq b] = \int_a^b f(x)\,dx\). Total area must equal 1.
CDF — Cumulative Distribution Function
\(F(x) = P[X \leq x]\). The probability that the variable is at most x. Always non-decreasing from 0 to 1. The CDF of the standard normal is written Φ(x).

Events

Any subset of the sample space we care about assigning a probability to.

Event (E)
Any subset of the sample space. An event "occurs" when the outcome of a trial falls within that subset.
Die roll S = {1,2,3,4,5,6}. Event "roll even" = {2,4,6}. Event "roll a 3" = {3}. Event "roll ≥ 4" = {4,5,6}.
Special events:
• Empty set ∅ — the impossible event, P[∅] = 0.
• The full sample space S — the certain event, P[S] = 1.
• Events can be infinite: "more than 100 customers" = {101, 102, 103, …} is a valid event.
Event ≠ Outcome. An outcome is a single element of S. An event is a subset (could contain one element, many, or none). "Rolling a 3" as an outcome is just 3; as an event it is {3}.

Axioms of Probability

Three Basic Axioms (Kolmogorov)
1. P[E] ≥ 0 for every event E.
2. P[S] = 1 (something always happens).
3. For mutually exclusive events E₁, E₂ (no overlap): P[E₁ ∪ E₂] = P[E₁] + P[E₂].

Distributions

How probability is spread across the possible values of a random variable.

Distribution
An assignment of probabilities to all possible outcomes. Requirements: (1) each probability ≥ 0, (2) all probabilities sum (or integrate) to 1.

Uniform Distribution

Discrete Uniform — all outcomes equally likely
For n outcomes: P[X = i] = 1/n for every outcome i.
Fair six-sided die: P[X=1] = … = P[X=6] = 1/6.
Continuous Uniform[a, b]
Constant density over [a, b], zero outside. PDF: \(f(x) = \tfrac{1}{b-a}\) for \(a \leq x \leq b\).
\[\mu = \frac{a+b}{2} \qquad \sigma^2 = \frac{(b-a)^2}{12}\]
Uniform[20,40]: mean = (20+40)/2 = 30, variance = (20)²/12 = 400/12 ≈ 33.33.

Gaussian (Normal) Distribution N(μ, σ)

Gaussian / Normal Distribution
The classic "bell curve." Two parameters: mean μ (center/location) and standard deviation σ (spread). Symmetric around μ. Total area under curve = 1.
\[f(x \mid \mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}}\exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\]
Key Properties of the Gaussian
Increasing σ → curve becomes wider and flatter (same area = 1).
Increasing μ → curve shifts right, shape unchanged.
• Empirical rule: ~68% of data within 1σ, ~95% within 2σ, ~99.7% within 3σ.
• Notation: N(μ, σ) means std dev = σ, so variance = σ².
As sample size grows (100 → 1,000 → 10,000), a relative-frequency histogram converges to the true PDF — this is the Law of Large Numbers in action.

Exponential Distribution (λ)

Exponential Distribution
Models time between events in a Poisson process (arrivals, radioactive decay). Rate parameter λ. PDF: \(f(x) = \lambda e^{-\lambda x}\) for x ≥ 0.
\[\mu = \frac{1}{\lambda} \qquad \sigma^2 = \frac{1}{\lambda^2}\]
f(x) = 0.1e^{−0.1x} → λ = 0.1, mean = 1/0.1 = 10.

Triangular Distribution Tri(a, b, c)

Triangular Distribution
Defined on [a, b] with peak (mode) at c. PDF rises linearly from a to c, falls linearly from c to b.
\[\mu = \frac{a+b+c}{3} \qquad \sigma^2 = \frac{a^2+b^2+c^2-ab-ac-bc}{18}\]

Absolute vs Relative Frequency

Absolute Frequency
Raw count of samples in a bin.
30 of 200 samples fall in bin [10,15] → absolute frequency = 30.
Relative Frequency
Count ÷ total samples = proportion. Approximates probability density as sample size → ∞.
30 / 200 = 0.15 = 15% relative frequency for that bin.

Descriptive Statistics

Measures that summarize the center and spread of a dataset.

Range

\[\text{Range} = \max(S) - \min(S)\]
S=[−1,3,−4,2,6]: Range = 6 − (−4) = 10.

Mean (Arithmetic Average)

Mean x̄
Sum of all values divided by n. Sensitive to outliers — a single extreme value can pull it far from where most data sits.
\[\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i\]
S=[−1,3,−4,2,6]: Sum = 6, Mean = 6/5 = 1.2.

Median

Median
Middle value after sorting. Robust to outliers — extreme values in the tails do not affect it.

• n odd: median = element at index ⌊n/2⌋ (0-indexed).
• n even: median = average of elements at indices n/2−1 and n/2.
Sorted [−4,−1,2,3,6] (n=5 odd) → Median = 2.   Sorted [−1,0,7,7,9,10] (n=6 even) → Median = (7+7)/2 = 7.

Percentile — Linear Interpolation Method

p-th Percentile (NumPy "linear" default)
Steps:
1. Sort the data into S′.
2. Fractional index (0-indexed): \(i = \tfrac{p}{100} \times (n-1)\)
3. If i is an integer → P_p = S′[i].
4. If i is fractional: \(P_p = S'[\lfloor i\rfloor] + (i - \lfloor i\rfloor)\cdot(S'[\lceil i\rceil] - S'[\lfloor i\rfloor])\)
S′=[−4,−1,2,3,6], n=5. 20th %ile: i = 0.20×4 = 0.8. Between S′[0]=−4 and S′[1]=−1: −4 + 0.8×(−1−(−4)) = −4 + 2.4 = −1.6.

Trimmed Mean

Trimmed Mean
Remove a fixed proportion from each tail of the sorted data, then compute the mean of what remains. Robust to extreme outliers in the tails. scipy.stats.trim_mean(data, p) removes proportion p from each end.
S = [−1,0,3,6,7,7,8,9,10,1000], 10% trimmed: remove 1 from each end → [0,3,6,7,7,8,9,10], mean = 50/8 = 6.25.
Robustness ranking (most to least): Median ≈ Trimmed Mean > Mean. Extreme percentiles (90th, 95th) are unreliable near outliers.

Measures of Dispersion

How spread out the data is — and how robust each measure is to outliers.

Why not use the sum of deviations? It always equals zero:
\(\frac{1}{n}\sum_{i=1}^n (X_i - \bar{X}) = \bar{X} - \bar{X} = 0\)
Positive and negative deviations cancel perfectly. We need all deviations to be positive.

Mean Absolute Deviation

\[\text{Mean Abs Dev} = \frac{1}{n}\sum_{i=1}^n |X_i - \bar{X}|\]
Average of absolute deviations from the mean. Uses absolute values so nothing cancels. Less sensitive to large errors than variance, but not as robust as Median Absolute Deviation.

Median Absolute Deviation (MAD)

\[\text{MAD} = \text{median}(|x_i - \text{median}(x)|)\]
The median of the absolute deviations from the median. Very robust — uses the median twice (once to center, once to summarize). Outliers have minimal impact.

Variance & Standard Deviation

Population Variance σ² (use when you have ALL the data)
Divide by n (the full population size).
\[\sigma^2 = \frac{1}{n}\sum_{i=1}^n (X_i - \mu)^2\]
Sample Variance s² — Bessel's Correction (use for a sample)
Divide by (n−1). Corrects the underestimation bias introduced by using x̄ (estimated from the same data) instead of the true μ.
\[s^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2\]
Units of variance: If data is in °C, variance is in °C². This is why variance is hard to interpret — the unit is squared. Standard deviation (√variance) restores the original unit.

Interquartile Range (IQR)

\[\text{IQR} = Q_3 - Q_1 = P_{75} - P_{25}\]
Range of the middle 50% of data. Most robust of all dispersion measures — extreme values in the tails have zero effect because they aren't used in the computation.

Robustness Comparison

MeasureRobustness to OutliersWhy
IQRMost robustUses only middle 50%; ignores tails
Median Abs Dev (MAD)Very robustUses median twice
Mean Abs DevModerateUses the mean, which outliers pull
Sample Std Dev (s)SensitiveSquared deviations amplify outliers
Variance (s²)Most sensitiveSquaring makes one outlier dominant

Box-and-Whisker Plot (Tukey Convention)

Box Plot
• Box spans Q1 to Q3 (the IQR).
• Line inside the box = median (Q2).
• Whiskers extend to the last data point within 1.5×IQR of Q1 (lower) and Q3 (upper).
• Points beyond whiskers are plotted as individual dots — these are flagged as outliers.

Outliers & Skew

How extreme values distort statistics and reveal the shape of a distribution.

Skewness

Right Skew (Positive Skew)
Long tail on the right. Large outliers pull the mean right of the median. Rule of thumb: mean > median → right skew.
S=[3,8,6,9,−1,10,1000,7,7,0]: mean=104.9, median=7. Mean ≫ median → strongly right skewed.
Left Skew (Negative Skew)
Long tail on the left. Small outliers pull the mean below the median. Rule of thumb: mean < median → left skew.
Symmetric (No Skew)
Mean ≈ median. The Gaussian distribution is perfectly symmetric.

Effect of One Outlier on Each Statistic

StatisticOutlier EffectRobust?
MeanStrongly pulled toward outlierNo
MedianLittle to no effectYes
Trimmed Mean (10%)Removed if outlier is in the tailYes
Variance / Std DevHeavily inflated (squared deviations)No
IQRNo effect (outlier is in the tail)Yes
Extreme %iles (90th+)Heavily distorted by interpolationNo
Median percentile (50th)Same as median — robustYes
Extreme percentiles near outliers are unreliable. If the 90th percentile calculation interpolates between a normal data point and an outlier, the result is wildly inflated. Be cautious with percentiles above ~85th when outliers may exist.
Worked example — S = [3,8,6,9,−1,10,1000,7,7,0], sorted: [−1,0,3,6,7,7,8,9,10,1000]
Mean = 1049/10 = 104.9  |  Median = (7+7)/2 = 7
80th %ile: i = 0.8×9 = 7.2 → 9 + 0.2×(10−9) = 9.2
90th %ile: i = 0.9×9 = 8.1 → 10 + 0.1×(1000−10) = 109 ← distorted by outlier!

Population vs Sample Statistics

The critical difference in formulas — and why Bessel's correction exists.

Population
The complete set of all subjects of interest. You know the true parameters (μ, σ²). Use n in the denominator.
Sample
A subset drawn to make inferences about the whole. Sample statistics estimate population parameters. Use (n−1) in variance (Bessel's correction).

Formulas Side by Side

StatisticPopulation (÷ N)Sample (÷ n−1)
Mean\(\mu = \frac{1}{N}\sum X_i\)\(\bar{x} = \frac{1}{n}\sum X_i\)
Variance\(\sigma^2 = \frac{\sum(X_i-\mu)^2}{N}\)\(s^2 = \frac{\sum(X_i-\bar{x})^2}{n-1}\)
Std Dev\(\sigma = \sqrt{\sigma^2}\)\(s = \sqrt{s^2}\)
Bessel's Correction — Why n−1?
When computing sample variance we use x̄ (estimated from the same data) instead of the true μ. The sample mean always sits at the center of the sample, making deviations from x̄ systematically smaller than deviations from the true μ. Dividing by (n−1) instead of n corrects this downward bias, making s² an unbiased estimator of σ².
Worked — S = {4, 7, 9, 12, 15, −9}, mean = 38/6 ≈ 6.33
Sum of squared deviations ≈ 355.33
Population variance (÷6) ≈ 59.22  |  Sample variance (÷5) ≈ 71.07

Z-scores, Φ (Phi) & Erf

Computing probabilities under any Gaussian by standardizing to N(0,1).

Z-Score

Z-Score (Standardized Score)
Converts a raw value x from X~N(μ,σ) to a standard normal score: how many standard deviations x is above or below the mean.
\[Z = \frac{x - \mu}{\sigma}\]
Test: μ=500, σ=100, x=620 → Z = (620−500)/100 = 1.2. John is 1.2 standard deviations above the mean.

Φ (Phi) — CDF of N(0,1)

Φ(z) — Phi Function
\(\Phi(z) = P[Z \leq z]\) where Z~N(0,1). It's the cumulative area under the standard normal bell curve from −∞ to z. To use for any Gaussian: compute Z first, then look up Φ(Z).
Φ(0) = 0.500, Φ(1) ≈ 0.841, Φ(1.2) ≈ 0.885, Φ(2) ≈ 0.977.

Error Function (erf)

\[\text{erf}(x) = \frac{2}{\sqrt{\pi}}\int_0^x e^{-t^2}\,dt\]
\[\Phi(z) = \frac{1}{2}\!\left[1 + \text{erf}\!\left(\frac{z}{\sqrt{2}}\right)\right]\]

Computing Probabilities

P(X < x) — One-tailed left
Compute Z = (x−μ)/σ, then P = Φ(Z).
P(score < 620): Z=1.2, Φ(1.2) ≈ 88.5%.
P(X > x) — One-tailed right
P(X > x) = 1 − Φ(Z).
P(a < X < b) — Interval probability
P(a < X < b) = Φ(Z_b) − Φ(Z_a), where Z_a=(a−μ)/σ and Z_b=(b−μ)/σ.
P(|X−μ| > k·σ) — Two-tailed outside ±kσ
P = 1 − erf(k/√2).
Steel rods: μ=100, σ=2. P(X<96 or X>104): k=2. P = 1−erf(√2) ≈ 1−0.9545 = 4.55%.

Key Φ Values to Memorize

ZΦ(Z) ≈Meaning
00.50050% below mean
1.00.841±1σ covers 68.3% of data
1.20.885John's exam example
1.6450.95090th percentile of N(0,1)
1.960.97595% CI boundary (two-sided)
2.00.977±2σ covers 95.4% of data
3.00.9987±3σ covers 99.7% of data

Covariance & Pearson Correlation

Measuring how two variables move together — direction and strength.

Covariance

\[\text{Cov}(X,Y) = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})\]
Interpreting the Sign of Cov(X,Y)
Cov > 0: When X is above its mean, Y tends to be above its mean (move together).
Cov < 0: When X is high, Y tends to be low (move opposite).
Cov ≈ 0: No consistent linear relationship.
Covariance is scale-dependent — its magnitude depends on the units of X and Y. A covariance of 425 between height(cm) and weight(kg) is not comparable to 425 between two other variables. That is why we normalize to get Pearson r.

Pearson Correlation Coefficient (r)

\[r = \frac{\text{Cov}(X,Y)}{\sigma_X \cdot \sigma_Y}\]
Normalizes covariance by the product of standard deviations. Always in [−1, +1]. Scale-independent (unit-free).
r valueInterpretation
+1Perfect positive linear relationship
0.7 to 1Strong positive correlation
0 to 0.3Weak positive correlation
≈ 0No linear relationship (may have nonlinear!)
−1 to 0Negative correlation
−1Perfect negative linear relationship
r = 0 does NOT mean no relationship! Y = X² gives r ≈ 0 (symmetric parabola — positive and negative halves cancel). Pearson r only detects linear relationships. Always plot the data with a scatter plot.
Y = constant → r is undefined. σ_Y = 0 means division by zero in the formula. A constant has no linear relationship with anything.
Worked — X=[1,3,5,7], Y=[8,6,5,3], μ_X=4, μ_Y=5.5
Deviations X: −3,−1,+1,+3  |  Deviations Y: +2.5,+0.5,−0.5,−2.5
Products: −7.5, −0.5, −0.5, −7.5 → sum = −16
Cov = −16/3 ≈ −5.33  |  σ_X=√(20/3), σ_Y=√(13/3)
r = −16/√260 ≈ −0.992 (strong negative).

Bias & Error Types

Understanding systematic and random sources of error in data collection.

Random Error
Unpredictable, unsystematic variation in measurements. Averages out over many trials. Not a flaw — every measurement has some random component.
A scale gives slightly different readings each time you weigh the same 100 g object due to vibrations.
Systematic Error (Bias)
Consistent, repeatable error in the same direction. Does not average out. Indicates a flaw in the study design, instrument, or data collection process.
A miscalibrated scale that always reads 5% too high.

Types of Bias

Measurement Bias
Systematic error in how data is measured — the instrument or procedure introduces consistent errors in one direction.
A laser rangefinder miscalibrated by ×1.2 → all reported distances are 20% too large.
Observer Bias
The researcher's expectations or prior beliefs influence how they collect, record, or interpret data. Also called experimenter bias or confirmation bias.
A police officer who expects a high-crime neighborhood interprets ambiguous behavior as suspicious.
Selection Bias
The sample is not representative of the target population because some groups are systematically over- or under-represented in the selection process.
A pollster calls only landlines during daytime → misses mobile-only users and working people → biased sample.
Non-Response Bias
Participants who do not respond differ systematically from those who do, making the final sample non-representative even if selection was random.
University survey with only 30% response rate — students with strong opinions (positive or negative) are more likely to respond.
Self-Reporting Bias
Participants inaccurately report their own behaviors or attitudes due to memory errors, misunderstanding, or discomfort with truthful answers.
Participants in a weight-loss study self-report monthly weight loss — imperfect memory leads to inaccuracies.
Social Desirability Bias
Participants respond in ways they believe are socially acceptable or will reflect well on them, rather than truthfully.
People over-report exercise frequency and charitable donations; under-report alcohol use and unhealthy eating.
Multiple biases can co-occur. Self-reporting bias and social desirability bias often appear together. Selection bias and non-response bias can affect the same study simultaneously.

Quick Reference

Bias TypeWho is biased?Key Signal
MeasurementThe instrumentConsistent systematic offset in all readings
ObserverThe researcherPrior expectations alter observations
SelectionThe sampling processSome population groups excluded
Non-responseThe respondentsLow response rate; non-responders differ
Self-reportingThe participantInaccurate recall or reporting
Social desirabilityThe participantReporting what sounds good, not true

Sampling Distributions & CLT

Why sample means behave predictably — even when individual data doesn't.

Sampling Distribution
The distribution of a statistic (e.g., sample mean x̄) across many repeated samples from the same population. You draw n samples, compute x̄; repeat m times → the collection of m sample means follows the sampling distribution of the mean.

Central Limit Theorem (CLT)

Central Limit Theorem
For i.i.d. samples from any distribution with finite mean μ and variance σ², the distribution of the sample mean x̄ approaches Normal as n → ∞, regardless of the underlying distribution's shape.
\[\bar{X} \xrightarrow{d} N\!\!\left(\mu,\; \frac{\sigma^2}{n}\right) \quad \text{as } n \to \infty\]
Key CLT facts:
• Works for ANY underlying distribution (log-normal, exponential, uniform, …) as long as it has finite variance.
• As n (samples per mean) increases → distribution of x̄ gets narrower and more bell-shaped.
• The mean of the sampling distribution = population mean μ (unbiased).
• More trials m (more sample means computed) → smoother histogram. More samples per mean n → narrower bell.
CLT applies to sample MEANS, not individual samples. Individual samples always follow the underlying distribution. Only the distribution of x̄ becomes approximately Normal.

Standard Error of the Mean

\[\text{SE} = \frac{\sigma}{\sqrt{n}}\]
The standard deviation of the sampling distribution of x̄. Measures how much sample means vary across repeated experiments. As n grows, SE shrinks → your estimate of μ becomes more precise.

Standard Deviation vs Standard Error

Standard Deviation (σ)Standard Error (SE)
MeasuresSpread of individual data pointsSpread of sample means across experiments
Formula\(\sqrt{\frac{\sum(x_i-\bar{x})^2}{n-1}}\)\(\sigma/\sqrt{n}\)
As n → ∞Stays the same (individual variation is real)→ 0 (means converge to μ)
Use whenDescribing how variable individuals areDescribing precision of the mean estimate

Spark & PySpark

Distributed computing for datasets too large for a single machine.

Core Concepts

SparkSession
The entry point to all Spark functionality. Wraps a SparkContext (which coordinates with the cluster). In PySpark: spark = SparkSession.builder.getOrCreate(). Use it to read data, execute SQL, and create DataFrames.
RDD — Resilient Distributed Dataset
Spark's fundamental data structure:
Immutable: cannot be changed after creation; transformations produce new RDDs.
Distributed: partitioned across cluster nodes for parallel processing.
Fault-tolerant: lost partitions can be recomputed from lineage (transformation history).
DataFrames are built on top of RDDs.
DataFrame
A distributed dataset organized into named, typed columns (like a table). The preferred API in modern PySpark. Enables the Catalyst optimizer to work most effectively.

Lazy Evaluation

Lazy Evaluation
Transformations are not executed when called — they add nodes to a logical query plan. Execution is deferred until an Action is called. This allows Spark to optimize the entire pipeline before running any computation.

Transformations (Lazy — do not trigger execution)

MethodPurpose
df.select("col1","col2")Choose specific columns
df.filter("condition") / df.where(...)Filter rows by condition
df.groupBy("col").agg(...)Group rows and aggregate
df.withColumn("new", expr)Add or replace a column
df.orderBy("col")Sort rows
df.join(other, on, how)Join two DataFrames

Actions (Trigger full execution)

MethodPurpose
df.show(n)Print first n rows to screen
df.count()Count total rows
df.collect()Return all rows to driver as Python list
df.first()Return first row
df.write.*()Write results to storage

Catalyst Optimizer

Catalyst Optimizer
Spark's query optimizer that processes the logical plan before execution:
Predicate pushdown: moves filter conditions close to the data source — reads less data.
Projection pruning: drops unused columns early.
Join reordering: reorders joins for efficiency.
Generates multiple physical plans, picks the cheapest (cost estimation), and compiles to JVM bytecode.

NumPy vs Pandas vs Spark

LibraryBest ForLimitation
NumPyFast math on arrays; SIMD-vectorized; C backendSingle machine only
PandasTabular data manipulation on one machine; rich APISingle machine; limited by RAM
SparkDatasets too big for one machine; cluster-scaleOverhead: scheduling, network, serialization — overkill for small data
Python is slow for raw computation (heap allocation per object, GIL, no SIMD). But NumPy, Spark, and PyTorch do the heavy math in compiled C/Scala/CUDA — Python is just the glue/API layer.

Common PySpark Patterns

# Select specific columns
df.select("PassengerId", "Survived", "Pclass").show(5)

# Filter rows (two equivalent ways)
df.filter((col("Survived")==1) & (col("Pclass")==1)).show()
df.filter("Survived == 1").filter("Pclass == 1").show()

# Add a new column
df.withColumn("FamilySize", col("SibSp") + col("Parch")).show()

# Group and aggregate
df.groupBy("Pclass").count().orderBy("count", ascending=False).show()

# Read a CSV file
df = spark.read.csv("/path/file.csv", header=True, inferSchema=True)

Object Classes

A regular object class in Java has exactly 4 types of methods.

4 types: Constructors · Accessors (getters) · Mutators (setters) · toString

Constructors

Constructor
Initializes a new object. No return type (not even void). Same name as the class. A class can have multiple constructors (overloading). A no-arg constructor takes no parameters.
public Person(String name, int age){ this.name = name; this.age = age; }

Accessors (Getters)

Accessor / Getter
Returns the value of a private field. Return type matches the field type. No parameters. Name convention: getFieldName().
public String getName(){ return name; }
public int getAge(){ return age; }

Mutators (Setters)

Mutator / Setter
Sets/updates the value of a private field. Return type is always void. Takes one parameter matching the field type. Name convention: setFieldName(type param).
public void setName(String name){ this.name = name; }
public void setAge(int age){ this.age = age; }

toString

toString()
Returns a String representation of the object. Return type is always String. No parameters. Called automatically when you print an object.
public String toString(){
  return "Person: " + name + ", Age: " + age;
}

Quick Reference

Method TypeReturn TypeParametersPurpose
Constructornone (not void)0 or moreInitialize object
Accessormatches field typenoneGet field value
Mutatorvoid1 (matching field)Set field value
toStringStringnoneText representation

The this Keyword & Shadowing

Shadowing is one of the most common bugs in object classes.

Shadowing
Occurs when a local variable or parameter has the same name as an instance field. The local name "shadows" (hides) the field — assignments go to the local variable, not the field.
// BUG — shadowing! name = name assigns param to itself
public void setName(String name){ name = name; }

// FIX — use this.name
public void setName(String name){ this.name = name; }
Two ways to prevent shadowing:
1. Use the this. keyword to explicitly refer to the instance field.
2. Use different names for the parameter and the field (e.g., field: name, param: n).
The this keyword
Refers to the current instance of the class. this.fieldName always means the instance field, not a local variable. Also used to call another constructor: this(args).

Method Overloading & Return Types

Same name, different parameters — Java picks the right version automatically.

Method Overloading
Defining multiple methods with the same name but different parameter lists (different number, type, or order of parameters). Return type alone cannot distinguish overloads.
public void print(int x){ ... }       // version 1
public void print(String s){ ... }    // version 2 — different param type
public void print(int x, int y){ ... } // version 3 — different param count
Return type alone cannot overload. Two methods with the same name and same parameters but different return types will cause a compile error.
Return Types & return keyword
Every non-void method must return a value with the return statement. The return type in the method signature must match exactly. void methods do not return values (but can use bare return; to exit early).
public int add(int a, int b){ return a + b; }   // returns int
public String greet(){ return "Hello"; }         // returns String
public void print(){ System.out.println("Hi"); } // void, no return value

The static Keyword

Static members belong to the class, not any particular instance.

static field / variable
A single shared memory location for all instances of the class. Regardless of how many objects are created, there is only one copy of a static field. Accessed via the class name: ClassName.fieldName.
public class Counter {
  public static int count = 0;  // one copy, shared
  public Counter(){ count++; }  // every new object increments the same count
}
static method
Called on the class (not an object). Cannot access instance fields (non-static). Good for utility functions that don't depend on object state.
public static int square(int n){ return n * n; }
// called as: ClassName.square(5);
When to use static: constants (static final), counters shared across all objects, utility/helper methods like Math.max().
A static method cannot call a non-static method or use this — there is no object instance in scope.

Aggregation (Has-A Relationship)

One class uses another class as a field — modeling real-world ownership.

Aggregation
When a class contains an object of another class as a field. This models a "has-a" relationship. Example: a University has a Person array.
public class University {
  private String name;
  private Person[] faculty;  // aggregation: University HAS-A Person[]
  private int count;
  ...
}
Has-A relationship: The containing class "has" objects of another class. Distinguish from "Is-A" (inheritance) — e.g., "a Dog is an Animal."
Add Method in Intermediate (Aggregation) Class
To add an object to the internal array, you instantiate the object inside the method and store it at the logical index (counter), then increment the counter. The getter for the array is not always needed — but an add method is.
public void addPerson(String name, int age){
  faculty[count] = new Person(name, age);
  count++;
}

Arrays

Fixed-size, ordered collection of elements of the same type.

Declaration & Instantiation
type[] name = new type[size]; — allocates a fixed block of memory. Size cannot change after creation.
int[] scores = new int[5];       // size 5, all zeros
String[] names = new String[3];  // size 3, all null
Default Values
Numeric types default to 0, booleans to false, objects/Strings to null.
Physical vs Logical Length
Physical length = arr.length — total slots allocated.
Logical length = a separate counter variable tracking how many elements have actually been added. Always ≤ physical length.
int[] data = new int[10];  // physical = 10
int count = 0;             // logical starts at 0
data[count] = 42; count++; // logical becomes 1
ArrayIndexOutOfBoundsException — thrown when you access index < 0 or ≥ arr.length. Always use logical length as the loop bound when the array isn't full.
Iterating
// Print all elements (full array):
for(int i = 0; i < arr.length; i++){
  System.out.println(arr[i]);
}
// Enhanced for-loop:
for(int val : arr){ System.out.println(val); }

ArrayList

Dynamic array that grows and shrinks automatically. From java.util.ArrayList.

Declaration & Instantiation
import java.util.ArrayList;
ArrayList<String> list = new ArrayList<String>();

Key Methods

MethodPurpose
list.add(element)Append element to end
list.add(index, element)Insert at position
list.get(index)Get element at index
list.set(index, element)Replace element at index
list.remove(index)Remove element at index
list.size()Number of elements (logical size)
list.contains(obj)Check if element exists
ArrayList vs Array: ArrayList is dynamic (resizes automatically), can only hold objects (not primitives — use wrappers). Arrays have fixed size, can hold primitives, use .length. ArrayList uses .size().
ArrayList cannot hold primitives directly. Use ArrayList<Integer> not ArrayList<int>. Autoboxing converts automatically.

Wrapper Classes

Object versions of Java primitives. Required for collections like ArrayList.

PrimitiveWrapper Class
intInteger
doubleDouble
charCharacter
booleanBoolean
longLong
floatFloat

Useful Methods

Integer.parseInt("42")     // String → int
Integer.toString(42)       // int → String
Double.parseDouble("3.14") // String → double
Character.isLetter('a')    // true
Character.toUpperCase('a') // 'A'
Autoboxing & Unboxing
Autoboxing: Java automatically converts a primitive to its wrapper when needed (e.g., adding int to ArrayList<Integer>).
Unboxing: automatic conversion from wrapper back to primitive.
ArrayList<Integer> nums = new ArrayList<>();
nums.add(5);      // autoboxing: int 5 → Integer(5)
int x = nums.get(0); // unboxing: Integer(5) → int 5

String Methods

Strings are objects in Java — immutable and full of useful methods.

MethodReturnsDescription
str.length()intNumber of characters
str.charAt(i)charCharacter at index i
str.substring(a,b)StringChars from index a up to (not including) b
str.toUpperCase()StringAll uppercase
str.toLowerCase()StringAll lowercase
str.trim()StringRemove leading/trailing whitespace
str.equals(other)booleanContent equality (use instead of ==)
str.equalsIgnoreCase(other)booleanCase-insensitive equality
str.contains(s)booleanTrue if s is a substring
str.indexOf(s)intFirst index of s, or -1
str.replace(old,new)StringReplace all occurrences
str.split(regex)String[]Split into array by delimiter
String.valueOf(x)StringConvert any type to String
Concatenation: Use + to join strings. "Hello" + " " + "World""Hello World". Any type concatenated with a String is automatically converted.
Use .equals() not == to compare String content. == compares references (memory addresses), not values.

Formatting Output

Three ways to control decimal precision and field width.

printf / String.format Format Specifiers

SpecifierMeaningExample
%dIntegerprintf("%d", 42)42
%fFloating pointprintf("%f", 3.14)3.140000
%.2f2 decimal placesprintf("%.2f", 3.14159)3.14
%8dWidth 8, right-alignpads with spaces on left
%-8dWidth 8, left-alignpads with spaces on right
%sStringprintf("%s", "hi")hi
%nNewlineplatform-independent newline
DecimalFormat
From java.text.DecimalFormat. Formats numbers using a pattern string.
import java.text.DecimalFormat;
DecimalFormat df = new DecimalFormat("0.00");
System.out.println(df.format(3.14159)); // → 3.14

DecimalFormat df2 = new DecimalFormat("#,##0.00");
System.out.println(df2.format(1234567.8)); // → 1,234,567.80
String.format
Same specifiers as printf but returns a String instead of printing.
String s = String.format("%.2f", 3.14159); // s = "3.14"
String t = String.format("%-10s|", "Hi");  // t = "Hi        |"

Primitives vs Objects & Pass by Value/Reference

Understanding how Java passes data to methods.

The 8 Primitive Types

byte · short · int · long · float · double · boolean · char
Primitive vs Object
Primitives store the value directly in memory. Objects store a reference (address) to the object on the heap. Strings are objects, not primitives.

Pass by Value (Primitives)

Pass by Value
A copy of the value is passed to the method. Changes inside the method do NOT affect the original variable.
void addTen(int x){ x += 10; } // only changes the copy
int n = 5;
addTen(n);
System.out.println(n); // still 5!

Pass by Reference (Objects)

Pass by Reference
A copy of the reference (address) is passed. The method accesses the same object. Changes to the object's fields ARE visible outside the method.
void rename(Person p){ p.setName("Bob"); }
Person person = new Person("Alice", 20);
rename(person);
System.out.println(person.getName()); // "Bob" — object was changed!
Java always passes by value. For objects, the value passed is the reference (address). Reassigning the parameter inside the method doesn't affect the caller's variable — but modifying the object via that reference does.

Random Class & Nested Loops

Generating random numbers and using nested iteration.

Random Class
import java.util.Random;
Random rng = new Random();

rng.nextInt(n)         // random int: 0 to n-1 (inclusive)
rng.nextInt(10)        // 0–9
rng.nextInt(10) + 1    // 1–10 (shift range up)
rng.nextInt(6) + 1     // 1–6 (simulate die roll)

// General range [min, max] inclusive:
rng.nextInt(max - min + 1) + min
Formula for range [a, b] inclusive: rng.nextInt(b - a + 1) + a
Nested For Loops
An inner loop runs completely for each iteration of the outer loop. Total iterations = outer count × inner count.
for(int i = 1; i <= 3; i++){
  for(int j = 1; j <= 3; j++){
    System.out.print(i * j + "\t");
  }
  System.out.println();
}
// Output: multiplication table 1..3

JUnit Testing

Testing individual methods in isolation.

Test Structure
import org.junit.jupiter.api.Test;
import static org.junit.jupiter.api.Assertions.*;

public class PersonTest {
  @Test
  public void testGetName(){
    Person p = new Person("Alice", 20);
    assertEquals("Alice", p.getName());
  }
}

Common Assert Methods

AssertWhat it checks
assertEquals(expected, actual)Two values are equal
assertNotEquals(a, b)Two values are not equal
assertTrue(condition)Condition is true
assertFalse(condition)Condition is false
assertNull(obj)Object is null
assertNotNull(obj)Object is not null
Always test getters right after construction. Test setters by setting a value then getting it back. Test toString for expected string output.
Each test method must be annotated with @Test. If an assertion fails, the test immediately stops and reports failure. One assert per test is a good practice.