Performance profiling

Under reproducible load, capture flame graphs and allocation traces; separate profiler overhead from real hotspots and follow with targeted micro-benchmarks.

The SKILL names the profiler (async-profiler, pprof, Chrome Performance, etc.), sampling duration, and production safety switches; avoid rewriting hot paths before measuring.

For GC, lock, and I/O waits, cross-check metrics (latency percentiles, saturation) with stacks so you don’t mistake symptoms for root causes.

Hypothesis →repro →sample →flame graph →verify

  [ Performance hypothesis / user-visible symptom ]
        │
        ▼
  ┌─────────────┐     Pin: version, data size, hardware, concurrency, cache state
  │ Repro load  │──── Cold vs steady-state runs—don’t mix JIT warmup conclusions
  └─────────────┘
        │
        ▼
  ┌─────────────┐     Periodic stack samples (CPU) or events (alloc, lock, I/O)
  │  Profiler   │──── Long enough for multiple requests/iterations; throttle in prod
  └─────────────┘
        │
        ▼
  ┌─────────────┐     Fold identical stacks; width = share of samples on that path
  │ Flame graph │──── Self time vs children; watch inlining / missing native frames
  └─────────────┘
        │
        ▼
  ┌─────────────┐     Micro-benchmark or A/B; keep numbers + regression tests
  │   Verify    │──── Document why this path is hot and the rollback plan
  └─────────────┘

Sampling is statistical—too few samples makes bars noisy; before raising frequency, assess impact on production traffic.

Sample rate, duration, and overhead

CPU sampling: periodic stack grabs (e.g. every 10–100ms); denser intervals stabilize stats but increase interrupt/buffer cost. Prefer a long enough window with moderate frequency over ultra-short high-frequency bursts.

Alloc / locks / wall-clock: often event- or timer-driven; document alloc tracing, user/kernel frames, and whether CI uses smaller datasets for smoke profiles.

  • Separate cold-start vs steady-state captures to avoid JIT confusion.
  • Label dataset size and hardware for large inputs.
  • Redact profiles before sharing or archiving.

Node.js CPU profiling: capture V8 tick files with --prof and generate flame graphs:

# 1. Start Node.js with --prof (generates isolate-*.log)
node --prof app.js &
APP_PID=$!

# Apply load (use autocannon / wrk or similar)
npx autocannon -c 50 -d 30 http://localhost:3000/api/heavy

kill $APP_PID

# 2. Convert isolate-*.log to human-readable tick file
node --prof-process isolate-*.log > processed.txt

# 3. Use 0x to generate interactive flame graph HTML in one step
npx 0x -- node app.js
# Automatically generates flame graph in ./{pid}.0x/; open in browser to interact

# 4. Use clinic.js suite (friendlier diagnostic tooling)
npm install -g clinic
clinic doctor -- node app.js        # comprehensive diagnosis
clinic flame  -- node app.js        # CPU flame graph
clinic bubbleprof -- node app.js    # async call bubble chart

# 5. Micro-benchmark a specific code path (benchmark.js)
const Benchmark = require('benchmark');
const suite = new Benchmark.Suite();
suite
  .add('method A', () => { slowMethod(); })
  .add('method B', () => { fastMethod(); })
  .on('cycle', (e) => console.log(String(e.target)))
  .on('complete', function() {
    console.log('Fastest: ' + this.filter('fastest').map('name'));
  })
  .run({ async: true });

Reading flame graphs

Classic CPU flame graphs: horizontal width shows relative share of samples on that path (not chronological order); vertical depth is call depth, usually entry →hot leaves.

Look here first

  • Widest “plateaus”: among sibling callees, width shows the dominant contributor.
  • Sudden widening vs baseline: pin regressions to the new path.
  • Deep recursion or repeating motifs: N+1 queries, blocking, or leaky abstractions.

Common misreads

  • “On stack” ≠ high self time—use folded views or self metrics.
  • Inlining and tail calls: symbols may appear under caller names.
  • Heavy GC with CPU-only views—add allocation or heap profiles.

Cross-checking metrics and stacks

After optimizations, keep before/after numbers and regression tests; use feature flags when rolling out; document why the hotspot mattered for future maintainers.

  • Latency percentiles vs widest flame paths—same subsystem?
  • Saturation (CPU, disk, network) vs off-CPU / wall profiles—consistent story?
  • Lock waits: thread dumps + lock profiles aligned with traces in time.

Web Vitals measurement (LCP/FID/CLS) and Lighthouse CI integration:

// Web Vitals measurement (web-vitals v3)
import { onLCP, onFID, onCLS, onINP, onFCP, onTTFB } from 'web-vitals';

function sendToAnalytics(metric) {
  const body = JSON.stringify({
    name: metric.name,
    value: metric.value,       // raw value (ms or score)
    rating: metric.rating,     // 'good' | 'needs-improvement' | 'poor'
    delta: metric.delta,
    id: metric.id,
    navigationType: metric.navigationType,
  });
  navigator.sendBeacon('/analytics', body);
}

onLCP(sendToAnalytics);   // Largest Contentful Paint: target < 2500ms
onFID(sendToAnalytics);   // First Input Delay: target < 100ms (deprecated, prefer INP)
onINP(sendToAnalytics);   // Interaction to Next Paint: target < 200ms
onCLS(sendToAnalytics);   // Cumulative Layout Shift: target < 0.1
onFCP(sendToAnalytics);
onTTFB(sendToAnalytics);

Lighthouse CI configuration and GitHub Actions integration:

// lighthouserc.js — Lighthouse CI config
module.exports = {
  ci: {
    collect: {
      url: ['http://localhost:3000/', 'http://localhost:3000/about'],
      numberOfRuns: 3,          // multiple runs for median
      startServerCommand: 'npm run start',
      startServerReadyPattern: 'listening on',
    },
    assert: {
      preset: 'lighthouse:recommended',
      assertions: {
        'categories:performance': ['error', { minScore: 0.9 }],
        'categories:accessibility': ['warn', { minScore: 0.95 }],
        'largest-contentful-paint': ['error', { maxNumericValue: 2500 }],
        'cumulative-layout-shift': ['error', { maxNumericValue: 0.1 }],
        'interactive': ['warn', { maxNumericValue: 3800 }],
      },
    },
    upload: {
      target: 'temporary-public-storage',  // or configure LHCI server
    },
  },
};

// Integrate Lighthouse CI in GitHub Actions
// - name: Run Lighthouse CI
//   run: |
//     npm install -g @lhci/cli
//     lhci autorun
//   env:
//     LHCI_GITHUB_APP_TOKEN: ${{ secrets.LHCI_GITHUB_APP_TOKEN }}

Tooling cheat sheet

  • JVM: async-profiler (CPU/alloc), JFR; outputs as collapsed stacks or JFR convertible to flame HTML.
  • Go / native: pprof (cpu, heap, mutex); go tool pprof -http= for interactive flames.
  • Front end: Chrome Performance, Lighthouse—long tasks, layout, script self time.

Profiling session draft

Fill the fields to paste a “sampling + flame graph” checklist into tickets or SKILL appendices (illustrative—align with your runbooks).

Load phases

              

Prefer your team’s collapsed stack format for exports; when comparing runs, pin binaries and input datasets so bar width changes aren’t environmental drift.

---
name: performance-profiling
description: Use profilers to find CPU/memory hotspots with verifiable follow-ups
tags: [performance, profiling, web-vitals, lighthouse]
---
# Profiling Methodology
- Start with a hypothesis: user-visible symptom + concrete path (URL/endpoint/code path)
- Reproducible load: pin version, data size, hardware, concurrency, cache warm/cold state
- Tool selection: Node.js uses --prof / 0x / clinic; frontend uses Chrome DevTools + Lighthouse

# Node.js CPU Profiling
- --prof + --prof-process generates tick file; 0x generates flame graph HTML in one step
- clinic doctor for comprehensive diagnosis; clinic flame for CPU; clinic bubbleprof for async
- Micro-benchmarks with benchmark.js; keep before/after numbers and regression test cases

# Frontend Performance
- LCP target < 2500ms; INP target < 200ms; CLS target < 0.1
- web-vitals library collects real-user data in production and reports to analytics
- Lighthouse CI: lighthouserc.js configures assertions; blocks non-compliant PRs in CI

# Flame Graph Interpretation
- Wide bars = that path accounts for a large share of samples (not absolute time)
- Look first at the widest "plateaus", then drill down to leaf functions
- Overlay CPU and allocation graphs: distinguish CPU hotspots from GC/memory pressure

# Regression Prevention
- Performance baselines in CI: run benchmarks and compare against main branch
- Lighthouse CI minScore/maxNumericValue gates block merges on regression
- Feature flags to control rollout of optimizations; maintain rollback capability
- Document hotspot causes and optimization rationale for future maintainers

Back to skills More skills