All posts

This post remains available as published. Older posts may use historical terminology that does not match the current public Gambit framing.

Context engineering is the way

By Dan Sisco

We’ve…been working on this for over a year…and…he just…he tweeted it out.

Context engineering is the new hotness, and we’re so excited there’s a term for
this now!

Threading this needle is the key to building a reliable, successful LLM
application. And it happens to be exactly what we’ve been working on at Bolt
Foundry.

What is context?

Context is everything your model sees before it sends a response.

It’s your prompt, user message, tool calls, user turns, samples, and grader.

With too much context (like prompt stuffing) you divert the LLM’s attention and
reliability plummets. With too little context, LLMs are left guessing and
filling in gaps, which is similarly unreliable.

Why does this matter?

In human-to-human communication, context is key. We get context through dozens
of verbal and nonverbal cues, like body language, tone of voice, eye contact,
pitch, and more. Notably remote work sucks because we lose the majority of these
contextual clues on Zoom.

LLMs also need context to perform reliably.

We’ve found the best way to provide this context is:

  1. Create data samples from examples of success and failure
  2. Build graders from those samples that reinforce what you want the LLM to do
  3. Structure your prompt with clear information hierarchy

Our work with samples, graders, and evals is context engineering at its core.
We’re structuring feedback and examples to optimize LLM performance, which is
exactly what Karpathy is describing.

What does this look like in practice?

We recently implemented this approach with Fastpitch, an AI-generated sports
newsletter. We wrote about it here (note: the
aibff CLI referenced there is deprecated), but the highlights are:

  1. We started by creating Ground Truth samples of stories collected by the LLM,
    scored by a human
  2. We created an additional collection of synthetic samples to reinforce the
    learning
  3. We used those data samples to build a Grader that evaluates story data
  4. We iterated on that Grader until it agreed with the Ground Truth samples
  5. We then used that Grader to adjust our prompt

Proper information hierarchy also helps LLMs perform better. We've seen this
over and over with customers.

We took one customer from
86% reliability on XML output to 100%
in less than an hour with some basic prompt tweaks.

This approach to giving the model "just the right information" with human-graded
samples, a calibrated Grader, and correctly formatted prompt is the heart of
context engineering.

We're thrilled to see more people talking about this.

If you're interested in learning more about context engineering, and making LLM
development more science than art,
join our community on Discord.