ZERPLEX
Quizzes
Compatibility
Sign In
AI IQ: How Smart Are You About AI?
Question 1 of 30
What does the 'attention mechanism' in transformers fundamentally compute?
The gradient of the loss function with respect to each individual token embedding in the sequence
A weighted sum of values, where weights come from query-key similarity
The probability distribution over the entire vocabulary for predicting the next token in the sequence
A depthwise separable convolution applied across the input token embeddings
In backpropagation, what problem arises in very deep networks that makes training difficult?
The weight matrices become too large for available GPU VRAM to store during the forward pass
Every activation function becomes fully saturated at each layer, blocking all signal propagation
Gradients can vanish or explode as they propagate through many layers
The composite loss function becomes piecewise non-differentiable at most operating points
What is the key innovation of the 'Attention Is All You Need' paper (Vaswani et al., 2017)?
Replacing recurrence entirely with self-attention for sequence modeling
Introducing the concept of attention mechanisms to the field for the first time ever
Demonstrating that deep convolutional networks outperform RNNs on machine translation tasks
Inventing distributed word embeddings that capture rich semantic relationships between tokens
Why is the scaling factor 1/sqrt(d_k) used in scaled dot-product attention?
It normalizes each output vector to have unit L2 length, stabilizing all downstream layer computations in the network
It enables more efficient memory-aligned matrix multiplication on modern GPU tensor cores
It compresses the attention matrix to reduce peak memory usage during the forward pass
It keeps dot products from growing too large, preventing softmax saturation
In reinforcement learning, what is the 'credit assignment problem'?
Fairly distributing reward signal among multiple competing agents in a cooperative-competitive environment
Determining which past actions were responsible for a delayed reward
Allocating fixed computational resources optimally across different components of the policy network
Properly attributing intellectual ownership when training on data from many different sources
What is 'Goodhart's Law' and why does it matter for AI alignment?
Given sufficient data and compute, any model will eventually converge to the globally optimal solution for any task
Scaling model parameters reliably produces proportional capability gains across all downstream tasks and benchmarks
When a measure becomes a target, it ceases to be a good measure — models optimize proxies in unintended ways
AI system performance follows predictable power-law scaling curves with respect to total training compute budget
What does the Lottery Ticket Hypothesis (Frankle & Carbin, 2018) claim?
Dense networks contain sparse subnetworks that match full accuracy from their original initialization
Neural network training is fundamentally stochastic, and final performance depends primarily on lucky random seed selection
Model performance is determined by a small critical subset of training examples rather than the full training dataset
Random architecture search consistently discovers near-optimal network topologies for any given downstream task
In the context of LLMs, what does 'in-context learning' refer to?
Performing gradient-based fine-tuning on a small task-specific dataset immediately before running inference
Pre-training the model with bidirectional contextual embeddings using a masked language modeling objective
Leveraging surrounding repository files and documentation to improve code generation accuracy
Adapting behavior from examples in the prompt with no parameter updates
What is the fundamental trade-off that the bias-variance decomposition describes?
Wall-clock training speed versus final held-out test accuracy for a fixed compute budget
Overfitting (low bias, high variance) versus underfitting (high bias, low variance)
Total model parameter count versus tokens-per-second inference throughput on target hardware
Using systematically biased training data versus using high-variance noisy data
Why is RLHF used instead of just supervised fine-tuning for aligning LLMs?
RLHF requires significantly less compute and fewer human annotations than supervised fine-tuning
Supervised fine-tuning fundamentally cannot modify the behavioral patterns of a pre-trained model
Humans judge between outputs more easily than they write ideal ones
RLHF is a fully automated process that eliminates all need for human annotator involvement
What is the key insight behind the KV cache optimization in transformer inference?
Past tokens' key/value vectors don't change during autoregressive generation, so cache and reuse them
Pinning the full model weight tensors in GPU HBM eliminates PCIe transfer overhead on each forward pass
Caching the post-softmax attention probability matrices removes redundant recomputation across decoding steps
Applying vocabulary compression via byte-pair encoding merges reduces the output projection dimensionality
A model achieves 95% accuracy on a test set but fails catastrophically on slightly perturbed inputs. What is this vulnerability called?
Catastrophic forgetting during sequential multi-task continual learning
Adversarial fragility — lacking robustness to adversarial perturbations
Severe distributional overfitting to the specific statistical properties of the test set
Mode collapse resulting from insufficient diversity in the training distribution
Research on 'Directional Goodhart' conditions shows that when optimizing a reward model for one stakeholder, the impact on a hidden stakeholder depends on what geometric property?
The L2 distance between the stakeholder preference vectors in the reward model's latent embedding space
The determinant of the joint preference matrix formed by stacking all stakeholder reward vectors
The cosine similarity between the optimization target and the hidden stakeholder's preference vector
The tensor rank of the combined multi-stakeholder preference decomposition in reward feature space
The 'Bitter Lesson' (Rich Sutton) argues that AI research repeatedly shows what pattern?
Carefully hand-engineered features consistently outperform features learned automatically from raw data
Smaller, parameter-efficient models reliably surpass larger models when evaluated on out-of-distribution tasks
Symbolic reasoning systems with explicit knowledge bases are fundamentally superior to neural network approaches
General methods leveraging computation scale better than methods leveraging human domain knowledge
Recent research on 'A Bitter Lesson for Data Filtering' (Mohri et al., Stanford) found that at sufficient scale, what happens to the benefit of data filtering?
Data filtering becomes exponentially more critical as both model and dataset scale increase together
Large models benefit from low-quality data, and filtering advantages diminish
Only surface-level syntactic filtering retains value; deeper semantic curation always provides consistent gains
The primary value of data filtering shifts entirely from the pre-training phase to the fine-tuning phase
In the context of model quantization, what is the key challenge that mixed-precision approaches address?
Uniformly converting every weight tensor to the same low-precision integer format across the entire model
Enabling transformer-based language models to execute inference on CPU-only hardware without GPU acceleration
Layers differ in sensitivity to precision loss, so per-layer bit allocation balances accuracy and compression
Quantization-aware training techniques are fundamentally incompatible with non-convolutional network architectures
What is 'activation steering' in the context of LLM safety?
Adding direction vectors to internal activations at inference to shift behavior without retraining
Systematically pruning specific activation functions from targeted layers to reduce total model parameter count
Replacing sigmoid activations with ReLU variants throughout the network to improve gradient flow stability
Training an auxiliary classifier network to predict which individual neurons should be activated per input
The PID Steering framework applies what engineering concept to LLM behavior control?
Store-and-forward packet switching protocols originally developed for telecommunications network engineering
Principal component analysis for dimensionality reduction of high-dimensional statistical feature spaces
Instruction-level pipelining techniques from superscalar CPU microarchitecture design
Proportional-Integral-Derivative feedback controllers from control theory
What fundamental problem do Recursive Language Models (RLMs) solve that standard transformers cannot?
They jointly generate images and text within a single unified autoregressive decoding framework
They process arbitrarily long inputs via recursive self-invocation, breaking the context window limit
They eliminate GPU dependency entirely by using sparse CPU-optimized computation throughout the network
They achieve reliable one-shot learning from a single demonstration example without any fine-tuning
Research on 'Thinking with Visual Primitives' identifies a 'Reference Gap' in multimodal models. What is this gap?
The inability to provide precise spatial pointers when reasoning about visual content
The systematic mismatch in spatial resolution between images used during training and those encountered at inference time
The persistent accuracy gap observed when comparing unimodal text-only models against unimodal vision-only models on shared benchmarks
The failure to include proper bibliographic reference citations within generated analytical text passages
The LeWorldModel uses a 'Sketched-Isotropic-Gaussian Regularizer' (SIGReg). What problem does this solve in joint-embedding architectures?
It throttles the decoder to prevent generating excessively long output sequences during autoregressive inference
It dynamically adjusts the optimizer's learning rate schedule based on gradient norm statistics at each step
It prevents representation collapse where the encoder maps all inputs to the same point
It compresses the gradient accumulation buffers to reduce peak memory consumption during distributed training
What did research on 'Alignment Whack-a-Mole' reveal about finetuning and copyrighted content in LLMs?
Safety finetuning provably and permanently erases all memorized copyrighted content from model weights
Large language models never memorize verbatim copyrighted passages during the pre-training data ingestion phase
Post-training alignment procedures cause models to completely forget all knowledge acquired during pre-training
Finetuning reactivates latent recall of copyrighted books (85-90%), despite the model claiming otherwise
The Fast-Slow Training (FST) framework separates model parameters into 'fast' and 'slow' components. What does each represent?
Fast parameters reside on GPU HBM for rapid access; slow parameters are offloaded to host CPU DRAM
Slow = persistent parameters (long-term knowledge), Fast = task-specific adaptations
Fast components are the multi-head attention layers; slow components are the feed-forward network layers
Slow refers to the extended pre-training phase; fast refers to the rapid inference-time decoding phase
You're designing a reward model for an AI assistant. Optimizing for user satisfaction scores leads the model to give confidently wrong answers that users rate highly. This is an example of:
Reward hacking — exploiting the gap between the proxy metric and the true objective
Systematic underfitting caused by insufficient diversity and volume of human preference training data
Mode collapse where the generator network produces only a single high-scoring output distribution
Catastrophic forgetting of factual pre-training knowledge during the reinforcement learning phase
A researcher proposes a new attention variant that reduces complexity from O(n^2) to O(n log n) but performs worse on tasks requiring long-range dependencies. A critical thinker would ask:
Why wasn't this variant evaluated on a more comprehensive and diverse suite of standardized benchmarks?
Has the CUDA kernel implementation been profiled and optimized for the latest NVIDIA GPU microarchitectures?
What information-theoretic properties are lost, and for which tasks does this trade-off make sense?
Could the accuracy gap be closed by simply scaling the model parameters and training data proportionally?
The Generative Recursive Reasoning Model (GRAM) introduces stochastic latent transitions for reasoning. Why is stochasticity valuable in a reasoning framework?
Stochastic computation paths allow the model to skip unnecessary layers, reducing overall inference latency
It enables exploring multiple reasoning trajectories, avoiding commitment to a single wrong chain
Injecting random noise at each layer acts as an implicit regularizer that prevents training-time overfitting
It ensures the model never produces identical outputs for the same prompt, increasing response diversity
The MeMo (Memory as a Model) framework separates knowledge storage from reasoning. Why is this architecturally significant?
Decoupling memory from reasoning dramatically reduces the total parameter count required for equivalent performance
Isolating the retrieval component allows the reasoning model to achieve substantially faster token generation speed
It completely removes the pre-training stage, allowing models to be built entirely from structured knowledge bases
Knowledge can be updated in the memory module without catastrophic forgetting in the reasoning model
You have a 70B parameter model and need to deploy it on a single consumer GPU with 24GB VRAM. Which approach would be most effective?
Distill into a 7B student model and run that in standard fp16, accepting the capability loss from compression
Apply unstructured magnitude pruning to remove 90% of weights and run the sparse model in full fp32 precision
4-bit mixed-precision quantization for weights plus KV cache compression
Naively split layers between CPU DRAM and GPU VRAM with no quantization, accepting high PCIe transfer latency
A team reports a new model that beats GPT-4 on 12 benchmarks. As a critical researcher, which question is MOST important to ask first?
What specific GPU cluster hardware configuration and total compute budget were used for training this model?
Were benchmark datasets or similar data in the training set, and how was contamination tested?
What is the exact total parameter count and how does it compare to GPT-4's estimated architecture size?
Which programming language and deep learning framework were used to implement the training pipeline?
If you could pursue one research direction to most accelerate progress toward more capable and aligned AI, which demonstrates the deepest systems-level thinking?
Continuing to scale existing transformer architectures with proportionally larger datasets and increased compute budgets across clusters
Designing more rigorous and comprehensive evaluation benchmarks that better measure true model capabilities across diverse domains
Developing next-generation AI accelerator hardware that delivers substantially higher FLOPS per watt for large-scale distributed training
Training methods where the objective inherently captures human values, not proxy metrics
Back
Next
See My Results