Trajectory
[md]People optimize for local objectives and lose sight of global ones. The shape is always the same: you win at the thing directly in front of you, and in doing so, you lose something larger.
Most tasks are very local, and in ML this has become extreme. We train on language and evaluate on language. It’s the same domain, so it must be fair. But “all human knowledge” turned out to be larger than anticipated, and code performance doesn’t improve with math or translation.
Consider ImageNet, which has a proper train/test split. While foundational techniques like BatchNorm, residual connections and dropout transferred, more specific tweaks did not. For example, depth-wise separable convolutions aren’t a general backbone, as they don’t work in GANs. People say this is because GANs are a new domain with new requirements, but that’s too simplistic. The more aggressive formulation: it doesn’t transfer.
If I invent AdamR, tuned for ResNet-18 on ImageNet, which doesn’t transfer to CIFAR or language models, is it truly useful? It’s not a general algorithm. That’s why Adam is still king. Not because Adam is optimal for any particular task, but because it doesn’t require extensive tuning to work with transformers, ResNets, CIFAR, or Common Crawl.
The pattern is obvious retrospectively: CNNs “just didn’t work on some problems.” LSTMs failed to generalize to vision. Both showed early signs they wouldn’t generalize. Those failures were the signal they wouldn’t keep working on new domains. They went down as hacks that help early convergence, not as methods that stood the test of time.
If your method doesn’t work across existing domains, it won’t work for the next one either.
Taking a new job optimizes for local reward. Salary, status, interesting problems. But each job reshapes you. You adopt its values, its rhythms, its definitions of success. A few years in, you realize you’ve been optimized by the job more than you’ve optimized your career.
You won each decision and lost the trajectory.