利用 OpenTelemetry 手动埋点实现 Shadcn UI 搜索组件到 SkyWalking 的全链路追踪

可观测性

文章字数: 3.8k

阅读时长: 17 分

一个看似简单的搜索框，在生产环境中可能会成为性能黑洞。当用户反馈“搜索很慢”时，问题究竟出在哪里？是前端 debounce 逻辑的延迟，还是组件渲染的性能瓶GAO？是网络传输耗时，还是后端微服务之间的调用瀑布？如果缺乏一个从用户浏览器点击开始，贯穿整个技术栈的统一视图，定位这类问题无异于大海捞针。

我们的后端服务已经全面接入了 SkyWalking，但前端的状态对我们来说一直是个盲区。这次的目标就是打通这“最后一公里”：将 Shadcn UI 构建的搜索组件作为链路追踪的起点，通过 OpenTelemetry 前端 SDK 手动埋点，生成并传递 traceId，最终在 SkyWalking 中形成一条完整的、包含前端交互细节的全链路追踪。

技术痛点与初步构想

在开始之前，我们面临的核心痛点是明确的：分布式系统中的“观测孤岛”。后端链路追踪很成熟，但无法与用户的真实前端体验关联。用户的一个搜索操作，可能分解为以下几个关键阶段：

用户输入与防抖（Client-Side Logic）
发起网络请求（Network Latency）
后端服务处理（Backend Processing）
数据返回与前端渲染（Client-Side Rendering）

传统的监控只能看到第3步，而第1、2、4步的耗时、错误和性能细节完全缺失。

初步构想是利用 W3C 的 Trace Context 标准，让浏览器成为整个分布式追踪的发起者。这意味着我们需要在前端生成 traceparent 和 tracestate，并将它们作为 HTTP Header 注入到发往后端的请求中。SkyWalking Agent 能够自动识别这些 Header，并将后端的 Spans 关联到同一个 Trace 上。

技术选型与环境准备

前端组件库: Shadcn UI。我们选择它的 Command 组件作为搜索框的实现。它功能强大、组合性好，并且其异步数据获取的逻辑非常适合作为我们埋点的目标。
前端框架: Next.js (App Router)。
可观测性后端: Apache SkyWalking。已在生产环境部署，并配置为接收 OTLP (OpenTelemetry Protocol) 数据。
前端追踪 SDK: @opentelemetry/sdk-trace-web。这是 OpenTelemetry 官方的 Web 端 SDK，提供了手动创建 Span 和上下文传播的能力。
后端模拟服务: Spring Boot。搭建一个简单的搜索接口，并使用 SkyWalking Java Agent 进行自动探针。

后端模拟服务

为了让链路完整，我们先快速搭建一个后端服务。它只有一个 /api/search 接口，并人为制造一些延迟来模拟真实世界的业务逻辑。

pom.xml 核心依赖：

<dependencies>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
    </dependency>
    <!-- SkyWalking Agent 会自动注入，这里不需要额外依赖 -->
</dependencies>

控制器代码 SearchController.java:

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;

import java.util.Arrays;
import java.util.List;
import java.util.Random;
import java.util.stream.Collectors;

@RestController
@RequestMapping("/api")
public class SearchController {

    private static final Logger logger = LoggerFactory.getLogger(SearchController.class);
    private static final List<String> MOCK_DATA = Arrays.asList(
        "Apple", "Banana", "Cherry", "Date", "Elderberry",
        "Fig", "Grape", "Honeydew", "Kiwi", "Lemon"
    );

    private final Random random = new Random();

    @GetMapping("/search")
    public List<String> search(@RequestParam(name = "q", defaultValue = "") String query) throws InterruptedException {
        logger.info("Received search query: {}", query);

        // 模拟数据库查询或复杂计算的耗时
        Thread.sleep(100 + random.nextInt(200));

        if (query.isEmpty()) {
            return MOCK_DATA;
        }

        List<String> results = MOCK_DATA.stream()
            .filter(item -> item.toLowerCase().contains(query.toLowerCase()))
            .collect(Collectors.toList());
        
        logger.info("Found {} results for query '{}'", results.size(), query);

        // 模拟另一个内部服务调用的耗时
        callInternalService();

        return results;
    }

    private void callInternalService() throws InterruptedException {
        logger.info("Calling internal analytics service...");
        Thread.sleep(50 + random.nextInt(100));
        logger.info("Internal analytics service call complete.");
    }
}

启动这个 Spring Boot 应用时，必须挂载 SkyWalking Agent。

java -javaagent:/path/to/your/skywalking-agent/skywalking-agent.jar \
     -Dskywalking.agent.service_name=my-search-service \
     -Dskywalking.collector.backend_service=127.0.0.1:11800 \
     -jar your-app.jar

步骤化实现：从零到一的埋点过程

1. 初始化 OpenTelemetry Web SDK

这是最关键的一步。我们需要在应用的最顶层初始化 Tracer Provider。在 Next.js App Router 项目中，一个好的实践是创建一个 instrumentation.ts 文件或者在一个 Client Component 的顶层 Provider 中完成。

首先，安装依赖：

npm install @opentelemetry/api \
            @opentelemetry/sdk-trace-web \
            @opentelemetry/exporter-trace-otlp-http \
            @opentelemetry/instrumentation-fetch \
            @opentelemetry/context-zone

然后，创建一个 lib/telemetry.ts 文件来封装初始化逻辑。这里的坑在于，所有 OTel SDK 的操作都必须在客户端进行，所以要确保这段代码不会在服务端渲染（SSR）时执行。

lib/telemetry.ts:

// Important: This code should only run on the client side.
// We will ensure this by calling it within a 'use client' component's useEffect.

import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { WebTracerProvider, BatchSpanProcessor } from '@opentelemetry/sdk-trace-web';
import { ZoneContextManager } from '@opentelemetry/context-zone';
import { W3CTraceContextPropagator } from '@opentelemetry/core';
import { registerInstrumentations } from '@opentelemetry/instrumentation';
import { FetchInstrumentation } from '@opentelemetry/instrumentation-fetch';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';

// A singleton to ensure the provider is initialized only once.
let provider: WebTracerProvider | null = null;

export const initTracer = () => {
  // Avoid re-initializing in case of HMR or other re-renders.
  if (provider) {
    return;
  }
  
  // 1. Create a Resource to identify our application
  // This will be attached to all spans.
  const resource = new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'shadcn-search-frontend',
    [SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
  });

  // 2. Configure the OTLP Exporter
  // This sends traces to our collector (e.g., SkyWalking OAP)
  const exporter = new OTLPTraceExporter({
    // URL of your OpenTelemetry Collector or SkyWalking OAP HTTP receiver
    // SkyWalking OAP default HTTP receiver is on port 12800
    url: 'http://localhost:12800/v1/traces', 
    headers: {}, // You can add auth headers here if needed
  });

  // 3. Create a WebTracerProvider
  provider = new WebTracerProvider({
    resource: resource,
  });

  // 4. Use BatchSpanProcessor for better performance
  // It batches spans before sending them to the exporter.
  provider.addSpanProcessor(new BatchSpanProcessor(exporter, {
    // The maximum queue size. After the size is reached spans are dropped.
    maxQueueSize: 100,
    // The maximum batch size of spans for a single export.
    maxExportBatchSize: 10,
    // The interval between two consecutive exports
    scheduledDelayMillis: 500,
  }));

  // 5. Use ZoneContextManager to automatically manage context across async operations.
  // This is crucial for web applications.
  provider.register({
    contextManager: new ZoneContextManager(),
    propagator: new W3CTraceContextPropagator(), // This is the key for cross-service context propagation
  });

  // 6. Register automatic instrumentations
  // We are instrumenting the 'fetch' API to automatically create spans for outgoing requests
  // and inject the W3C Trace Context headers.
  registerInstrumentations({
    instrumentations: [
      new FetchInstrumentation({
        // You can configure it to ignore certain requests, e.g., requests to your own assets.
        ignoreUrls: [/.*\/_next\/.*/],
        // This function allows you to add custom attributes to the fetch spans.
        propagateTraceHeaderCorsUrls: [
          new RegExp('^http://localhost:8080/.*'), // Your backend API URL
        ],
        clearTimingResources: true,
      }),
    ],
  });

  console.log("OpenTelemetry Web Tracer initialized.");
};

接着，创建一个 TelemetryProvider 组件来调用这个初始化函数。

components/telemetry-provider.tsx:

'use client';

import { useEffect } from 'react';
import { initTracer } from '@/lib/telemetry';

export function TelemetryProvider({ children }: { children: React.ReactNode }) {
  useEffect(() => {
    // useEffect ensures this runs only on the client, and only once.
    initTracer();
  }, []);

  return <>{children}</>;
}

最后，在你的 layout.tsx 中使用它：

app/layout.tsx:

import { TelemetryProvider } from '@/components/telemetry-provider';

export default function RootLayout({ children }: { children: React.ReactNode }) {
  return (
    <html lang="en">
      <body>
        <TelemetryProvider>
          {children}
        </TelemetryProvider>
      </body>
    </html>
  );
}

至此，基础环境已经搭建完毕。FetchInstrumentation 会自动为所有对 http://localhost:8080 的 fetch 请求创建 Span，并注入 traceparent 头。这已经能将前后端链路串联起来，但这还不够，我们还需要更精细化的手动埋点。

2. 对 Shadcn UI Command 组件进行手动埋点

现在我们来改造搜索组件。我们的目标是创建一个覆盖整个“搜索会话”的父 Span，并在其中创建多个子 Span 来记录关键阶段。

gantt
    title Search Operation Trace
    dateFormat  X
    axisFormat  %Lms

    section Frontend Interaction
    Search Session (Parent Span)    : 0, 1200
    UserInput Debounce               : 50, 450
    Data Fetching (Client)           : 450, 850
    Results Rendering                : 850, 950

    section Backend Processing
    Backend API Call (Child of Data Fetching) : 455, 845
    DB Query                                  : 505, 705
    Internal Service Call                     : 705, 805

这个 Mermaid 图清晰地展示了我们想要追踪的父子 Span 关系。

components/observable-search.tsx:

'use a client';

import * as React from 'react';
import { Command as CommandPrimitive } from 'cmdk';
import { Search } from 'lucide-react';
import { api, context, trace, Span, SpanStatusCode } from '@opentelemetry/api';

import { cn } from '@/lib/utils';
import { CommandDialog, CommandEmpty, CommandGroup, CommandInput, CommandItem, CommandList } from '@/components/ui/command';

// Custom hook for debouncing
const useDebounce = <T,>(value: T, delay: number): T => {
  const [debouncedValue, setDebouncedValue] = React.useState<T>(value);
  React.useEffect(() => {
    const handler = setTimeout(() => {
      setDebouncedValue(value);
    }, delay);
    return () => {
      clearTimeout(handler);
    };
  }, [value, delay]);
  return debouncedValue;
};

// Define a type for our search results
interface SearchResult {
  id: string;
  value: string;
}

export function ObservableSearch() {
  const [open, setOpen] = React.useState(false);
  const [inputValue, setInputValue] = React.useState('');
  const [results, setResults] = React.useState<SearchResult[]>([]);
  const [loading, setLoading] = React.useState(false);

  // Get a tracer instance. The name should be descriptive.
  const tracer = trace.getTracer('shadcn-search-component-tracer');
  
  // Store the parent span in a ref to manage its lifecycle across re-renders.
  const parentSpanRef = React.useRef<Span | null>(null);

  const debouncedSearchTerm = useDebounce(inputValue, 400); // 400ms debounce delay

  React.useEffect(() => {
    // Keyboard shortcut to open the search dialog
    const down = (e: KeyboardEvent) => {
      if (e.key === 'k' && (e.metaKey || e.ctrlKey)) {
        e.preventDefault();
        setOpen((open) => !open);
      }
    };
    document.addEventListener('keydown', down);
    return () => document.removeEventListener('keydown', down);
  }, []);

  React.useEffect(() => {
    if (open) {
      // When the dialog opens, start a new parent span for the entire search session.
      // This span will encompass all actions until the dialog is closed.
      parentSpanRef.current = tracer.startSpan('search-session');
      console.log('Started search-session span:', parentSpanRef.current.spanContext().traceId);
    } else {
      // When the dialog closes, end the parent span.
      if (parentSpanRef.current) {
        parentSpanRef.current.end();
        parentSpanRef.current = null;
        console.log('Ended search-session span.');
      }
    }
  }, [open, tracer]);

  React.useEffect(() => {
    const performSearch = async () => {
      if (!debouncedSearchTerm) {
        setResults([]);
        return;
      }
      if (!parentSpanRef.current) return; // Should not happen if dialog is open

      // Use the parent span's context to create a child span.
      const ctx = trace.setSpan(context.active(), parentSpanRef.current);
      
      // Create a child span for the fetch operation.
      const fetchSpan = tracer.startSpan('perform-search-and-render', undefined, ctx);

      // We wrap the entire async operation in a context to ensure all subsequent
      // async tasks (like `fetch` and `.json()`) are associated with this span.
      await context.with(trace.setSpan(context.active(), fetchSpan), async () => {
        try {
          setLoading(true);
          fetchSpan.addEvent('Search initiated', { query: debouncedSearchTerm });

          // The FetchInstrumentation will create its own span. This `perform-search-and-render`
          // span acts as a business-logic wrapper around it.
          const response = await fetch(`http://localhost:8080/api/search?q=${debouncedSearchTerm}`);

          if (!response.ok) {
            throw new Error(`HTTP error! status: ${response.status}`);
          }
          
          const data: string[] = await response.json();
          fetchSpan.setAttribute('search.results.count', data.length);
          
          // Now, we'll measure the rendering time.
          const renderSpan = tracer.startSpan('render-results', undefined, trace.setSpan(context.active(), fetchSpan));
          
          setResults(data.map((item, index) => ({ id: `${item}-${index}`, value: item })));
          
          // Use requestAnimationFrame to mark the end of rendering after the next browser paint.
          // This gives a more accurate measure of when the user sees the results.
          requestAnimationFrame(() => {
            renderSpan.end();
            console.log('Ended render-results span.');
          });
          
          fetchSpan.setStatus({ code: SpanStatusCode.OK });
        } catch (error) {
          console.error('Search failed:', error);
          if (error instanceof Error) {
            fetchSpan.recordException(error);
            fetchSpan.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
          }
        } finally {
          setLoading(false);
          fetchSpan.end();
          console.log('Ended perform-search-and-render span.');
        }
      });
    };

    performSearch();
  }, [debouncedSearchTerm, tracer]);

  return (
    <>
      <button
        onClick={() => setOpen(true)}
        className="text-sm text-muted-foreground p-2 border rounded-md"
      >
        Search... <kbd className="ml-4 pointer-events-none inline-flex h-5 select-none items-center gap-1 rounded border bg-muted px-1.5 font-mono text-[10px] font-medium text-muted-foreground opacity-100">⌘K</kbd>
      </button>
      <CommandDialog open={open} onOpenChange={setOpen}>
        <CommandInput 
            placeholder="Type a command or search..." 
            value={inputValue} 
            onValueChange={setInputValue}
        />
        <CommandList>
          {loading && <div className="p-4 text-sm">Loading...</div>}
          {!loading && !results.length && debouncedSearchTerm && <CommandEmpty>No results found.</CommandEmpty>}
          <CommandGroup heading="Results">
            {results.map((result) => (
              <CommandItem key={result.id} onSelect={() => setOpen(false)}>
                {result.value}
              </CommandItem>
            ))}
          </CommandGroup>
        </CommandList>
      </CommandDialog>
    </>
  );
}

代码关键点解析：

Tracer 获取: trace.getTracer(...) 获取一个示踪器实例，这是所有手动埋点的入口。
父 Span 管理: 我们使用 useRef 来持有 search-session 这个父 Span。它在组件挂载（对话框打开）时创建，在卸载（对话框关闭）时结束，生命周期与用户的单次搜索交互完全绑定。
上下文传播: context.with(trace.setSpan(context.active(), fetchSpan), async () => { ... }) 是 OpenTelemetry JS 中最核心的模式。它确保了在 async 函数块内部创建的任何子 Span（包括由自动埋点库如 FetchInstrumentation 创建的）都能正确地将 fetchSpan 作为其父级。
精细化子 Span: 我们创建了 perform-search-and-render 来包裹整个异步获取和渲染流程，并在内部又创建了 render-results 来专门度量从拿到数据到浏览器绘制完成的时间。使用 requestAnimationFrame 来结束渲染 Span 是一种常见的、用于提升准确度的技巧。
属性 (Attributes) 与事件 (Events): 我们使用 span.setAttribute() 来添加结构化数据（如结果数量），使用 span.addEvent() 来记录操作过程中的时间点（如搜索开始）。
错误处理: 在 catch 块中，我们用 span.recordException() 记录了详细的错误信息，并用 span.setStatus() 将 Span 标记为错误状态。这在 SkyWalking UI 上会非常直观地显示为红色。

在 SkyWalking 中验证成果

完成上述代码后，运行前端和后端应用。打开搜索框，输入查询词，等待结果出现。然后打开 SkyWalking UI，你应该能看到一条由 shadcn-search-frontend 服务发起的、名为 search-session 的新 Trace。

点开这条 Trace，你会看到一个完美的调用链瀑布图：

根 Span 是 search-session，持续时间是你打开到关闭搜索框的总时长。
其下是 perform-search-and-render 子 Span。
perform-search-and-render 下面是 render-results Span 和一个由 FetchInstrumentation 自动生成的 HTTP GET Span。
点击 HTTP GET Span，你会发现它神奇地连接到了后端 my-search-service 的 /api/search 端点。
在后端服务的 Span 内部，你还能看到我们代码中模拟的两个 Thread.sleep 阶段，分别对应数据库查询和内部服务调用。

现在，当用户再说“搜索慢”时，我们不再是盲人摸象。通过这张图，我们可以精确量化每一个环节的耗时：是前端 debounce 时间太长？是网络延迟太高？是后端数据库查询慢？还是前端拿到数据后渲染耗时过久？所有答案，一目了然。

方案的局限性与未来展望

这个方案虽然强大，但也存在一些局限性和值得优化的地方。

首先，采样策略。在当前实现中，我们追踪了每一次搜索会话。在生产环境中，这会产生巨大的数据量和性能开销。OpenTelemetry SDK 支持配置采样器（Sampler），例如 TraceIdRatioBasedSampler，可以只采集一定比例的请求。在真实项目中，我们需要根据业务重要性和成本来制定合理的采样策略，比如对高价值用户或特定功能启用全量采集，而对普通流量进行概率采样。

其次，手动埋点的维护成本。手动埋点提供了极高的灵活性和精度，但代价是需要开发人员深入理解业务逻辑，并编写额外的代码。随着业务逻辑的变更，埋点代码也需要同步维护。未来可以探索结合自动埋点与手动埋点，用自动埋点覆盖通用场景（如路由切换、资源加载），在核心业务路径上再进行精细化的手动埋点。

最后，观测数据的关联性。目前我们只关注了 Trace。一个完整的可观测性平台应该能将 Traces、Metrics 和 Logs 关联起来。例如，我们可以在 Span 中注入用户的匿名 ID，当用户反馈问题时，可以用这个 ID 快速检索出他当时操作的所有 Trace 和相关日志，极大地提升故障排查效率。将前端性能指标（如 LCP, FID）作为 Span 的 attribute 添加进去，也是一个非常有价值的优化方向。

Shadcn UI SkyWalking 搜索 OpenTelemetry 全链路追踪

基于Nacos动态注入与TimescaleDB构建异构微服务全链路压测数据管道

2023-10-27 可观测性

CircleCI JavaScript 配置中心 TimescaleDB Ruby

基于 Go 与 Multi-Paxos 协议构建阿里云环境下的高可用分布式锁服务

2023-10-27 分布式系统

Go Alibaba Cloud Node.js Paxos 算法