Momory

Tech Insights

The engineering philosophy and technical challenges behind Momory.

Zero-Server Storage & Stateless Proxy Architecture

Momory is built as a 'stateless' service. While translation requests pass through our API route to communicate with the Gemini API, your data is never persisted. We function as a transparent proxy that exists only for the lifetime of a single request. This design allows for highly flexible horizontal scaling; even with a massive influx of users, we can maintain high stability and low latency by simply adding more server instances.

We strictly avoid any server-side logging or database storage of user content. However, because Momory relies on Google's APIs (Web Speech API & Gemini API), data is transmitted to Google's infrastructure. While we ensure your data doesn't stay with us, the privacy of the data once it reaches Google depends on the specific API usage (e.g., Free vs. Paid Gemini tier).

Direct Media Transcription via Web Speech API

Momory leverages the latest Web Speech API specification (Chrome 144+) to achieve high-precision transcription for media audio.

By capturing a browser tab's audio with getDisplayMedia, we obtain a MediaStreamTrack. This track is passed directly as an argument to SpeechRecognition.start(audioTrack), allowing the browser to process any tab's audio.

Note that Chrome's implementation of Web Speech API sends audio data to Google's servers for processing. This allows for high-accuracy recognition without heavy local processing, but means data is transmitted to Google (similar to using Google Assistant).

Service Worker for Privacy & Background Updates

To solve the challenge of background overlay updates (e.g., for single-monitor streaming), Momory employs a Service Worker to act as a private, in-browser 'mini-server'.

The dashboard sends translation data to this service worker, which then serves it to the overlay window. This entire process happens within the user's browser, ensuring conversation data never leaves their machine, upholding our 'privacy-first' promise.

LLM-Centric Context Engineering

Unlike traditional translation, Momory leverages the 'In-context Learning' capabilities of LLMs. By providing a sliding window of recent transcripts, we enable the model to understand the nuances of live conversation, such as subject omissions and ongoing topics.

This approach allows the AI to generate more coherent and contextually relevant subtitles compared to isolated sentence translation.

Advanced Cache Optimization with Implicit Caching

Momory is designed to maximize the benefits of Gemini's 'Implicit Caching' (look-ahead caching). Instead of sliding the history one by one, we accumulate history up to a certain threshold (approx. 50 lines).

By keeping the beginning of the prompt static, we achieve a higher cache hit rate, significantly reducing both latency and cost. When the threshold is reached, the oldest half of the history is purged at once, creating a new, stable base prompt. This strategy ensures high-speed performance while maintaining context.

STT Error Correction and Contextual Completion via Prompt Engineering

Real-time speech recognition (STT) is inherently imperfect. Our system prompt explicitly instructs the LLM to act as a 'repair and translation' layer.

The model is tasked with correcting fillers ('uhh'), repetitions, and misspellings by referencing the conversation history. Furthermore, it complements the often-omitted subjects in Japanese, ensuring the translation is natural and easily understood in the target language. This transforms raw STT output into polished, comprehensible subtitles.

Adaptive Debounce: Balancing 'Burst' Speed and 'Flow' Context

While technical speed is important, the best user experience comes from a 'sweet spot' of 0.8s to 2.0s. Momory uses an 'Adaptive Debounce' algorithm that switches between modes based on your speech patterns and API tier.

When you start talking after a silence (8s+), 'Burst Mode' triggers with a short delay (0.8s for Paid, 1.2s for Free) for instant feedback. During continuous speech, 'Flow Mode' adjusts the delay (1.2s for Paid, 2.0s for Free). This intentional pause allows more words to be gathered into a single request, preventing fragmented output while maintaining a live feeling in the paid tier.

Stability Layer & Rate Limit Strategy

A product's stability depends on a deep understanding of its API's rate limits. For the free tier, the most critical limit is TPM (Tokens Per Minute), set at 15,000.

To ensure stability, Momory employs an 'Adaptive Token Management' strategy. This system dynamically adjusts the amount of data sent to the AI based on the selected model and current usage patterns.

  • Safety Floor (Rate Limit Protection)Strictly enforces minimum intervals (4.5s for Free, 1.5s for Paid) to stay under RPM limits, preventing 429 errors even during very fast-paced or continuous speech.
  • Dynamic Token TrimmingThe system automatically prunes history, examples, and dictionary entries to fit within the API limits.
  • High-Context Performance (Paid Tier)Paid tier models benefit from relaxed limits, allowing for deeper context (50+ lines of history) and more complex instructions for superior translation quality.
  • 18-Hour ContinuityThe 14,400 RPD limit allows for about 18 hours of continuous streaming per day, ensuring reliability for even the most dedicated creators.

Real-time Stability Layer

Web Speech API results can be 'shaky' with frequent intermediate updates. Momory implements a stability layer that waits for a confidence threshold or a logical pause before triggering a translation, ensuring the overlay remains readable.

This reduces visual noise and keeps the audience focused on the content, not the flickering text.

Vibe-coding: UI/UX with Soul

Performance is a feature, but 'vibe' is an experience. We use modern frameworks like Tailwind CSS and Framer Motion to create a fluid, responsive UI that feels alive.

Key UI/UX considerations include high-performance feedback loops (like the 60fps volume meter), subtle 'glow' effects for active states, and standardized micro-interactions across all pages.

Dialogue with LLMs: Evolution via Static and Dynamic Instructions

We designed the AI not as a mere tool, but as a 'partner that grows through dialogue.' This dialogue takes two forms. First, 'static, pre-emptive instructions' via prompts and terminology dictionaries, which teach the AI the streamer's personality and specialized terms.

What makes Momory truly special is the 'dynamic feedback learning' during the stream. By just saying "Great translation!", the AI instantly learns from successful examples. To bridge the real-time lag between human speech and AI processing, we implemented 'Bulk Learning,' where multiple recent translations are analyzed together. This ensures the AI captures the full context of the desired style, evolving into a one-of-a-kind 'personal interpreter' for the world.