Amazon Machine Learning Engineer Interview Cheatsheet 2026

If you are preparing for an Amazon Machine Learning Engineer interview, expect more than "I know the model." You need to explain how the model works, how you would implement it, what breaks in production, and how you would decide whether it is good enough to ship.

The longer PracHub Amazon Machine Learning Engineer interview prep guide breaks this down by interview stage. This article condenses the highest-signal areas into a study guide you can use before a technical screen or onsite.

What Amazon is likely testing

For an MLE role, Amazon interviewers usually care about four things:

Can you reason from ML theory to working code?
Can you design systems that train, evaluate, and serve models reliably?
Can you debug models using metrics, data, and experiments?
Can you explain tradeoffs around latency, cost, memory, and quality?

A good answer does not stop at "use a Transformer" or "train XGBoost." You should be able to talk through tensor shapes, masks, evaluation gaps, distributed training, sparse data, online metrics, and deployment risk.

Transformers: know the internals, not just the vocabulary

Transformers are one of the highest-yield topics for an Amazon MLE interview. Be ready to explain scaled dot-product attention:

Attention(Q, K, V) = softmax((QK^T / sqrt(d_k)) + M)V

Here, M is often an additive mask. Allowed positions get 0; blocked positions get -inf. The sqrt(d_k) scaling keeps attention logits from getting too large and saturating the softmax.

For implementation questions, shape reasoning matters. Given input X with shape B x T x d_model, multi-head attention projects it into Q, K, and V, then reshapes them into something like:

B x num_heads x T x head_dim

The attention score tensor then has shape:

B x num_heads x T x T

A common bug is reshaping after a transpose without handling non-contiguous tensors. In PyTorch, that means knowing when .view() can break and when .reshape() or .contiguous() is safer.

For decoder-only models, causal masking is mandatory. Token t can only attend to positions <= t. If you forget this, the model can leak future labels during training. The loss may look great, but generation will fail.

You should also know the standard GPT-style block:

x = x + attention(LayerNorm(x))
x = x + MLP(LayerNorm(x))

This pre-norm layout is common because it helps gradient flow in deeper models. Post-norm matches the original Transformer pattern, but can be harder to train at scale.

LayerNorm is another frequent follow-up. It normalizes across the hidden dimension for each token independently:

LN(x) = gamma * (x - mean) / sqrt(variance + epsilon) + beta

Unlike BatchNorm, LayerNorm does not depend on batch statistics. That helps with variable batch sizes, sequence models, and autoregressive inference.

LLMs: connect architecture to operations

For LLM questions, you need to move between model internals and production behavior.

A strong answer covers:

Decoder-only Transformer architecture
Tokenization with BPE, WordPiece, or SentencePiece
Pretraining with next-token prediction
Instruction tuning with prompt-response data
Preference alignment methods such as RLHF or DPO
Fine-tuning choices such as full fine-tuning, LoRA, QLoRA, prefix tuning, and prompt tuning
Evaluation beyond perplexity
Serving constraints such as KV cache memory, throughput, and p99 latency

Perplexity is useful, but it is not enough. It measures next-token likelihood, not whether the model follows instructions, refuses unsafe requests correctly, produces grounded answers, or gives useful task outputs.

For a validation-system design question, structure your answer around:

Evaluation data

Use golden prompts, task-specific benchmarks, adversarial sets, regression cases from past failures, and production-like prompts sampled in a privacy-safe way.
Metrics

Include exact match where it fits, rubric scores, human preference win rate, hallucination or groundedness for RAG, toxicity or safety rates, refusal correctness, latency p50/p95/p99, tokens per second, and cost per request.
System components

Mention a model registry, prompt/version registry, evaluation runner, deterministic inference harness, result store, dashboard, and deployment gates.
Online validation

Use shadow tests, canary rollout, alerts for regressions, drift checks, and rollback criteria.

If the system is RAG-based, model quality depends on more than weights. Retrieval, chunking, embedding quality, ranking, prompt assembly, citation grounding, and index freshness all matter. Good evaluation should include retrieval recall@k, answer faithfulness, source attribution, and latency budget split across retrieval and generation.

MoE: sparse compute has systems costs

Mixture-of-Experts models often replace dense MLP layers with multiple expert networks and a learned router. A token may be sent to the top-1 or top-2 experts.

The benefit is that the model can have more total parameters without activating all of them for every token. The cost is systems complexity.

In an interview, avoid saying "MoE is more efficient" without explaining the tradeoff. Good answers mention:

Load-balancing losses
Expert collapse risk
Capacity factors
Token dropping during overload
Distributed all-to-all communication
Harder batching because routing is data-dependent
Higher risk around p99 latency

Dense models are simpler to serve. MoE models can scale parameter count better relative to FLOPs, but routing and communication make training and serving harder.

XGBoost: understand why it is fast

Amazon MLE interviews may still test classic ML, especially for tabular problems. XGBoost is a common topic because it mixes algorithm knowledge with systems thinking.

Gradient boosting builds an additive model:

y_hat_i^(t) = y_hat_i^(t-1) + eta * f_t(x_i)

Each new tree fits the residual signal, often framed as the negative gradient of the loss. This means boosting rounds are sequential. Tree t depends on predictions from earlier trees.

The parallelism is inside each tree:

across candidate splits
across features
across data partitions
across histogram bins
across workers in distributed training

XGBoost uses second-order information. Split scoring uses gradients and Hessians, with regularization terms such as lambda and gamma. You do not need to derive every line from memory, but you should be able to explain that XGBoost uses both first and second derivatives to score split quality.

For large datasets, exact split search can be expensive. Histogram-based split finding buckets continuous values into quantile bins, often far fewer than the number of raw thresholds. Workers build local histograms of gradient and Hessian sums, then reduce them. This gives better cache behavior and lower memory use, with some loss in split precision.

Also know why sparse handling matters. XGBoost learns a default direction for missing values, which helps with sparse one-hot data and missing feature values.

PyTorch implementation questions: be concrete

For "Implement a decoder-only GPT-style Transformer," start by clarifying scope:

"Should I implement a minimal PyTorch module with embeddings, positional encoding, masked multi-head attention, MLP blocks, and logits, or should I include training and generation too?"

Then state assumptions:

Input token IDs have shape B x T
Vocabulary size is V
Embedding dimension is C
Number of heads divides C
Output logits have shape B x T x V

Talk through token embeddings, positional embeddings or RoPE, stacked pre-norm blocks, causal masking, output projection, and loss.

Call out edge cases:

T exceeds configured context length
mask broadcasting is wrong
train/eval dropout behavior differs
causal mask is missing
.view() is used on a non-contiguous tensor
generation lacks a KV cache

A good implementation answer includes unit tests for shape, causal leakage, and a tiny overfit test to verify the model can learn.

Behavioral answers still need metrics

The source guide groups behavioral preparation under leadership principles, ownership, and measurable impact. For Amazon, that phrasing matters.

Do not give vague stories like "I improved model performance." Give the situation, your decision, the tradeoff, the result, and the metric. For an MLE, strong stories often include model quality, latency, cost, reliability, data quality, rollback decisions, or experiment design.

For example, a better answer sounds like:

"We had a relevance regression after a feature pipeline change. I traced the issue to offline/online feature mismatch, added validation checks before promotion, and reduced bad launches in that area."

The exact numbers depend on your experience, but the structure should make your ownership clear.

Final prep checklist

Before the interview, make sure you can answer these without notes:

Derive and explain scaled dot-product attention
Trace Transformer tensor shapes through multi-head attention
Explain causal masking and label leakage
Compare LayerNorm and BatchNorm
Discuss KV cache memory and autoregressive latency
Explain why perplexity is not enough for LLM evaluation
Design an LLM validation system with offline and online gates
Explain MoE routing and serving tradeoffs
Explain XGBoost histogram split finding and boosting-round dependency
Write a minimal PyTorch Transformer block
Tie every model choice to quality, latency, cost, or reliability

If you want to drill with targeted prompts, the PracHub interview questions library has practice questions across ML theory, system design, coding, and behavioral topics.

For the full role-specific breakdown, use the Amazon Machine Learning Engineer interview prep guide on PracHub as your main checklist.