OpenTelemetry tracing in 200 lines of code
Developers often treat tracing as something of a mystery, and OpenTelemetry is no exception. It can seem even more difficult because of the many new concepts you encounter in the basic examples.
The situation is exacerbated by the fact that as a stable and reliable tracing library is built, the code itself becomes more complex to account for rare cases, to support operation in different environments, and to optimize performance while minimizing application impact. This is especially true when using automated instrumentation that can “magically” rotate or modify code that was not originally intended to do so.
It’s no wonder that many developers think of tracing libraries as “black boxes”. We add them to apps, hope for the best, and rely on them in critical moments, such as nighttime incidents.
In fact, tracing is much easier than it seems. If you break it down, it can be presented as a combination of “unusual logging” and “context propagation”.
Contents
Logs
Developers are usually very comfortable working with beams. We start with Hello World! and they stay with us throughout the work. When a problem arises, we often resort to the logs, adding something like console.log("potato")
to check if the code is running (even with the debugger right at hand).
Logs are really helpful! However, I hope that one day someone will convince you to store them in key-value format for more structured and convenient analysis.
If you are thinking about one new year’s resolution related to the modernization of our infrastructure and preparation for the brave new world of distributed systems, I suggest this:
Structure your logs.
You won’t be sorry. You will eventually be so, so glad you did.
Just do it.
It is good to have at least some consistency in the logs: it is important to make sure that each log is timestamped in the same way or has a “name” field for easy searching. Most likely, you have already written such functions in your projects:
log("user-authenticated", { userId, remaingingRateLimit });
// ...
function log(name, attributes = {}) {
console.log(
JSON.format({
name,
timestamp: new Date().getTime(),
...attributes,
})
);
}
{ "timestamp": 1725845013447, "name": "user-authenticated", "userId": "1234", "remaingingRateLimit": 100 }
You could also write the following code to track the execution time of a subtask:
logTiming("query-user-info", () => {
db.fetchUserInfo();
});
// ....
function logTiming(name, attributes = {}, lambda) {
let startTime = new Date().getTime();
// run some subtask
lambda();
let durationMs = new Date().getTime() - startTime;
log(name, { durationMs, ...attributes });
}
{ "timestamp": 1725845013447, "name": "query-user-info", "durationMs": 12 }
If so, congratulations! You are halfway to inventing trace span.
Spans are ✨fanciful✨logs
A trace is a set of spans. The example below shows a trace with four different spans:
A span can really be thought of as a set of key/value pairs, similar to rows in a vine, with required fields:
-
Name: Defines the action or event described by the span.
-
Timestamp: Specifies the start moment of the span
-
Duration: Shows how long the span lasted
-
Set ID: This includes:
-
Trace ID: The identifier of the entire trace connecting all spans
-
Span ID: Unique identifier of a specific span
-
Parent Span ID: Identifier of the parent span (if any) that allows the hierarchy to be organized
-
Additionally, you can add any attributes in the form of key/value pairs, for example:
let span = new Span("api-req");
let resp = await api("get-user-limits");
span.setAttributes({ responseCode: resp.code });
span.End();
console.log(span);
// ...
class Span {
constructor(name, context = {}, attributes = new Map()) {
this.startTime = new Date().getTime();
this.traceID = context.traceID ?? crypto.randomBytes(16).toString("hex");
this.parentSpanID = context.spanID ?? undefined;
this.spanID = crypto.randomBytes(8).toString("hex");
this.name = name;
this.attributes = attributes;
}
setAttributes(keyValues) {
for (let [key, value] of Object.entries(keyValues)) {
this.attributes.set(key, value);
}
}
end() {
this.durationMs = new Date().getTime() - this.startTime;
}
}
A span statement in this format might look like this:
Span {
startTime: 1722436476271,
traceID: 'cfd3fd1ad40f008fea72e06901ff722b',
parentSpanID: undefined,
spanID: '6b65f0c5db08556d',
name: 'api-req',
attributes: Map(0) {
"responseCode": 200
},
durationMs: 3903
}
A tree entry corresponding to the above span could look like this:
{
"startTime": 1722436476271,
"traceID": "cfd3fd1ad40f008fea72e06901ff722b",
"spanID": "6b65f0c5db08556d",
"name": "api-req",
"responseCode": 200,
"durationMs": 3903
}
Trace is a set of spans
If you wanted to see all the logs from a particular query, you probably did something like:
// generate a request id and inherit one from your platform
let requestID = req.headers["X-REQUEST-ID"];
// ...
log("api-request-start", { requestID });
let resp = await apiRequest();
log("api-request-end", { requestID });
Which would allow you to search by request ID to see what was happening while it was being processed:
{ "timestamp": 1722436476271, "requestID": "1234", "name": "fetch-user-permissions" }
{ "timestamp": 1722436476321, "requestID": "1234", "name": "api-request-start" }
{ "timestamp": 1722436476345, "requestID": "1234", "name": "api-request-end" }
{ "timestamp": 1722436476431, "requestID": "1234", "name": "update-db-record" }
{ "timestamp": 1722436476462, "requestID": "1234", "name": "create-email-job" }
Doing tracing in this way can produce very good results! But we can do better.
There are 3 different IDs in the trace span that make up the trace context. The first two are simple:
-
Span ID: a random ID for each span
-
Trace ID: a random ID that can be common to multiple spans
The last one is a bit more complicated:
The Parent Span ID allows the system to create a DAG of each trace after receiving each span. After drawing as a tree, it gives us the waterfall display we know and love so much, but it’s worth remembering that this is only one of the possible data visualizations.
Context
The context needs only 2 values: the trace ID and the ID of the current span. When creating a new span, we can inherit the span ID, if any, create a new span ID, and create a new context.
let context = {
traceID: "cfd3fd1ad40f008fea72e06901ff722b",
spanID: "6b65f0c5db08556d",
};
We need a way to pass context in our application. We could do this manually:
let { span, newContext } = new Span("api-req", oldContext);
And yes, in the official Go SDK they do, but in most other languages this is done implicitly and handled automatically by the library. In Ruby or Python you can use a thread local variable, but in Node you would use AsyncLocalStorage.
At this point, it helps to wrap the span creation in a helper function:
import { AsyncLocalStorage } from "node:async_hooks";
let asyncLocalStorage = new AsyncLocalStorage();
// start with an empty context
asyncLocalStorage.enterWith({ traceID: undefined, spanID: undefined });
async function startSpan(name, lambda) {
let ctx = asyncLocalStorage.getStore();
let span = new Span(name, ctx, new Map());
await asyncLocalStorage.run(span.getContext(), lambda, span);
span.end();
console.log(span);
}
startSpan("parent", async (parentSpan) => {
parentSpan.setAttributes({ outerSpan: true });
startSpan("child", async (childSpan) => {
childSpan.setAttributes({ outerSpan: false });
});
});
And now we have the basis of our tracing library! 🎉
Span {
startTime: 1725850276756,
traceID: 'b8d002c2f6ae1291e0bd29c9791c9756',
parentSpanID: '50f527cbf40230c3',
name: 'child',
attributes: Map(1) { 'outerSpan' => false },
spanID: '8037a93b6ed25c3a',
durationMs: 11.087375000000002
}
Span {
startTime: 1725850276744,
traceID: 'b8d002c2f6ae1291e0bd29c9791c9756',
parentSpanID: undefined,
name: 'parent',
attributes: Map(1) { 'outerSpan' => true },
spanID: '50f527cbf40230c3',
durationMs: 26.076625
}
Our tracing library is quite viable despite only 60 lines of code, but it has 2 major drawbacks:
-
We have to add it everywhere manually.
-
It is limited to one system. We have no mechanism for passing context between two systems in the service.
Let’s fix it!
Distributed tracing sounds scary, but in general it means that trace context can be passed between systems in order to track what operations are caused by others and how that data is communicated to the end recipient.
When we make an HTTP request to another system, we need to pass the Trace ID and the current Span ID. We could add these two fields to the HTTP payload manually, but there is a W3C standard for encoding this information into the HTTP header to send as metadata for each request. Title traceparent
looks like this:
00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
The specification itself describes a large number of things that need to be understood and implemented, but for our purposes today we will ignore most of it and consider the following format:
00-{ Trace ID}-{ Parent Span ID }-01
It will allow us to parse and serialize the trace context using simple functions:
let getTraceParent = (ctx) => `00-${ctx.traceID}-${ctx.spanID}-01`;
let parseTraceParent = (header) => ({
traceID: header.split("-")[1],
spanID: header.split("-")[2],
});
We need to add this information to the outgoing requests and parse it for any incoming request. Instrumentation will help with this.
We turn pieces into other pieces
Instrumentation is a fancy term for “wrapping code into other code to track things”. Real tracing libraries go to great lengths to wrap built-in or popular libraries behind the scenes when configuring the library. We won’t do that.
Instead, we will create a middleware for the Hono framework that can be manually added by the user.
async function honoMiddleware(c, next) {
// check the incoming request for the traceparent header
// if it exists parse and inherit the trace context
let context = EMPTY_CONTEXT;
if (c.req.header("traceparent")) {
context = parseTraceParent(c.req.header("traceparent"));
}
// set the context and wrap the whole req / res operation in a span
await setContext(context, async () => {
await startSpan(`${c.req.method} ${c.req.path}`, async (span) => {
// Before we let our app handle the request, let's pull some info about
// it off and add it to our trace.
span.setAttributes({
"http.request.method": c.req.method,
"http.request.path": c.req.path,
});
// let our app handle the request
await next();
// Pull information about how our app responded
span.setAttributes({
"http.response.status_code": c.res.status,
});
});
});
}
We also need to handle the outgoing HTTP requests and make sure we add the header traceparent
. In the built-in command fetch
there’s no such thing as middleware, so we’ll wrap it manually like Shawarma with JavaScript. It’s not that scary, especially compared to how it’s done in practice.
// pass the original function into our wrapping logic
function patchFetch(originalFetch) {
// return a function with the same signature, but that executes our logic too
return async function patchedFetch(resource, options = {}) {
// generate and add the traceparent header
let ctx = getContext();
if (!options.headers) {
options.headers = {};
}
options.headers["traceparent"] = getTraceParent(ctx);
// run the underlying fetch function, but wrap it in a span and
// pull out some info while we're at it
let resp;
await startSpan("fetch", async (span) => {
span.setAttributes({ "http.url": resource });
resp = await originalFetch(resource, options);
span.setAttributes({ "http.response.status_code": resp.status });
});
// pass along fetch's response. It's like we were never here
return resp;
};
}
Here you can find an example application and see our tracing in action.
Let’s send it all to Honeycomb
We can output our spans to the terminal, but it’s not the most user-friendly experience. Before diving into OpenTelemetry, it’s useful to take a look at Honeycomb’s Events API. Previously, Honeycomb used a simpler “just send us JSON” approach. They don’t recommend this method now, but it can still be used in our project.
The full exporter code can be found here, but the most interesting part is the payload generation logic.
// literally put all of the data together in one big json blob... like a log line!
// and then POST it to their API
function spanToHoneycombJSON(span) {
return {
...Object.fromEntries(globalAttributes),
...Object.fromEntries(span.attributes),
name: span.name,
trace_id: span.traceID,
span_id: span.spanID,
parent_span_id: span.parentSpanID,
start_time: span.startTime,
duration_ms: span.durationMs,
};
}
Since we are not using a standard format, we will have to manually tell Honeycomb which fields will be responsible for the Trace ID, Span ID, and other key values of the dataset configuration. But that’s all it takes to build a waterfall trace!
So where is OpenTelemetry?
So we have fancy trace logs, and the instrumentation and context propagation actually turns out to be pretty simple. However, OpenTelemetry can seem like a large and complex standard, especially if you’ve seen the specification.
True, OpenTelemetry is a big project. But for our purposes we need only a small part of it. When you install the OpenTelemetry SDK for the selected language, this SDK transmits data using the OpenTelemetry Protocol (OTLP). Every OpenTelemetry SDK for any language uses OTLP. OpenTelemetry Collector is a set of tools for receiving, sending and converting data in OTLP format. Thus, OTLP is an important thing.
OTLP has its own protobuf specification that allows telemetry data to be compressed into a binary message compatible with any platform, OS and CPU architecture. In theory, we could generate a JavaScript module to parse and pass protobuf messages from .proto
‑files, but it seems like too much of a task.
Fortunately, the protobuf spec also provides JSON mapping, and the spec repository has a good example to get you started. Therefore, we will not complicate it.
Payload generation is a bit more complicated than the old Honeycomb event format. New terms such as “resource” and “context” appear, and we need to do some work on attributes. But if you look closely, it’s the same data. The full version is available in the repository.
function spanToOTLP(span) {
return {
resourceSpans: [
{
resource: {
attributes: toAttributes(Object.fromEntries(globalAttributes)),
},
scopeSpans: [
{
scope: {
name: "minimal-tracer",
version: "0.0.1",
attributes: [],
},
spans: [
{
traceId: span.traceID,
spanId: span.spanID,
parentSpanId: span.parentSpanID,
name: span.name,
startTimeUnixNano: span.startTime * Math.pow(10, 6),
endTimeUnixNano:
(span.startTime + span.elapsedMs) * Math.pow(10, 6),
kind: 2,
attributes: toAttributes(Object.fromEntries(span.attributes)),
},
],
},
],
},
],
};
}
But thanks to the power of standards, we are not limited to one service provider. We can send data to any service that supports OpenTelemetry!
Honeycomb still works, of course.
And Baseline too!
We can also visualize telemetry locally with otel-desktop-viewer!
We can even draw data in the terminal using otel-tui.
That’s all!
And that’s all. With 181 lines of code, we implemented tracing, context propagation, instrumentation, and exported to a standard format that can be handled by any vendor that supports it. And all thanks to the magic of 🌈 standards.
And is this… even… legal?
If you’ve been following the process closely, there are a large number of OpenTelemetry specifications that describe how the OpenTelemetry SDK should behave and be built. Our little library doesn’t do much of that. It’s for training purposes only, but you can imagine a simplified SDK that doesn’t conform to the standard, such as this one for Cloudflare Workers. It gives OTLP but does not follow all specifications. How should we look at it?
A similar question was posed to the co-founder of OpenTelemetry To Ted Jan during OTel Community Day in June. I recorded his response as best I could. Paraphrasing:
We don’t care about the spec. Let’s find out that the black box is sent to tracing and emits OTLP and semantic conventions. The true spec is the data.
The way I see it is this: while the official SDKs are bound to follow the spec, it’s reasonable for other implementations to deviate from the standard until an outside observer sees a difference in behavior. Our library doesn’t quite pass the shortcuts test when parsing the traceparent header, but fixing it won’t require a lot of code.
Falling… then why is opentelemetry-js so big?
If we can make a working tracer in less than 200 lines, why is there so much more code in the official JavaScript SDK?
In this case, it will be useful to mention some of the things that the JavaScript SDK handles that our library does not.
-
Buffering and grouping of output telemetry in a more efficient format. Don’t send one span per HTTP request – the telemetry vendor will want to talk.
-
Work both in the browser and in the Node environment.
-
Competent error handling, which allows you to wrap the main functionality with a library.
-
Possibility of configuration. Need to do something beyond a simple example? Probably possible using the SDK.
-
Automatically rotate frequently used libraries into proven instrumentation.
-
Performance optimization when used in tight loops.
-
Compliance with semantic conventions.
-
Support for two more types of telemetry: logging and metrics.
And so on! I hope this article has helped you begin to better understand how much work goes into the robustness and extensibility that is layered onto libraries like this when they are used in real projects. Creating a library that can work consistently everywhere we deploy JavaScript and support all the functionality required by a wide range of users is extremely non-trivial.
But tracing? We know it’s just a ✨ fancy login ✨ right?
You can find all the code used for this article in the jmorrell/minimal-nodejs-otel-tracer repository on GitHub.