Skip to main content
WardenOpen-source AI scannerExplore →
Engineering

How Our Local Semantic Router Achieves 20ms Classification

Gilad GabayJanuary 10, 20261 min read

Deep dive into our new local semantic router powered by sentence-transformers. Learn how we cut classification latency from 200-300ms to just 20ms.

How Our Local Semantic Router Achieves 20ms Classification

When we first built SharkRouter intelligent routing, we used cloud APIs for query classification. It worked, but with 200-300ms of added latency per request, it was not ideal for real-time applications.

Today, we are sharing how we rebuilt our semantic router to run entirely locally with just 20ms latency.

The Problem with Cloud-Based Classification

Our original approach:

  1. Send query to embedding API (~100ms)
  2. Compare against tier examples (~50ms)
  3. Return classification (~50ms network overhead)

Total: 200-300ms added to every request.

Our Solution: Local Sentence Transformers

We switched to paraphrase-multilingual-MiniLM-L12-v2, a 420MB model that runs entirely in your infrastructure.

Results

  • 10-20x faster than cloud APIs
  • $0 cost per classification
  • Same accuracy as before
  • 50+ languages supported
#semantic-router#performance
Share

Gilad Gabay

Co-Founder & Chief Architect

We use cookies for analytics to understand how visitors use our site. No advertising cookies. Privacy Policy