Life Lessons

Here is a running list of lessons that I have learned over the years from my mistakes. This list isn’t complete, as I am very capable at finding more spectacular ways to fail.

Work Principles

  1. Don’t reinvent the wheel—build the cart instead. Ask the best human (or AI) expert to identify existing solutions before building anything.
  2. Never micromanage, and never allow yourself to be micromanaged.
  3. Be aware of your blind spots and hire people who complement your weaknesses.
  4. People think and work differently. Create a space for them to explain their perspectives.

Life Principles

  1. Life is short; time is your most valuable currency.
  2. Accept that you will have shortcomings. Take full responsibility for your actions and own the outcomes.
  3. In the grand scheme of things, your existence is utterly meaningless. Only the pursuit of dreams gives life its meaning.
  4. Attachment is the cause of all suffering.
  5. Kindness costs little, yields much.

Machine Learning

  1. When in doubt, always start with AdamW with learning rate between $[10^{-4}, 10^{-3}]$ and a linear warm-up.
  2. Adam is a memory-hog because it tracks EMA estimates of the first two moments of the gradient. If VRAM-bound, test out your idea with SGD first.
  3. Covariance is the dimension-scaled dot product of two mean-centered vectors. Pearson correlation is the same quantity normalized by both vectors’ L2 norms, so it equals the cosine of the angle between them. This is why it always lies between −1 and 1.
  4. When training with large batches, for example, during pre-training, scale your learning rate as: $\mathrm{lr} \sim \sqrt{\mathrm{tokens}}$. For small batch size like 32, 64 etc, linear scaling may work better.
  5. Larger models require more data and computational resources but not linearly. Your model might be undertrained for its size.
  6. Floating-point precision varies with magnitude, so final result of arithmetic depends on the order of operations, ie. $(a + b) + c \neq a + (b + c)$. This makes many algorithms non-deterministic on GPU. For example, as of 2025, there is no deterministic kernel for torch.cumsum.