R. P. Ruiz — Voice Is the Primary Modality for Agentic AI

The Voice-First Thesis

The Observation

I started my career as a linguist, studying how humans actually communicate as we move through the world. The answer was obvious and overwhelming: we evolved the capacity walk and talk because our lives depended on it. Written language is a brilliant technology for transmitting ideas through time, but it is often overused and ill-suited for realtime communication. When we force AI interactions through keyboards and screens connected to text consoles and chat windows, we are over-indexing on a modality that's cognitively constraining and fundamentally unnatural when used for all of our realtime communications.

There's a reason why we can easily walk and talk at the same time. Contrast this with texting and driving.

The Problem

Screens demand visual attention. Consoles demand hands that can type. You can't drive, cook, or walk the dog while staring at a terminal. Current "voice AI" — the kind that sets timers and plays music — is a toy demo, not a serious interface. The real challenge is communicating with groups of agents, synchronously and asynchronously, using voice as a salience signal: if an agent reaches out to you using speech mode, then it must be important. Voice-first communication requires rethinking everything: notification architectures, trust models, decision routing, and real-time streaming.

If your agentic processes can only be operated through a screen, you've just increased the cognitive load on the user by constraining the form their interactions can take.

The Proof

It's possible right now to manage groups of agents — both synchronously and asynchronously — using your voice and your ears for the priority communications, and text and screens for lower priority status signaling. I know because I built it. Lupin, my open-source agentic framework, allows you to drive Claude Code with the Cosa Voice MCP, as well as manage non-Claude agentic processes, using speech-to-text, intent routing, agent orchestration, bidirectional notifications, and text-to-speech streaming behind the scenes. No screen required. This isn't a vision statement — it's a working system.

It's possible RIGHT NOW to manage high-priority communication with agents using nothing but your voice and ears.

Evidence from the Field

10 years of projects that led to one conclusion.

Real-time bidirectional voice
Lupin + CoSA Voice MCPThe reference platform that proves the thesis. A full voice-first agentic pipeline — speech-to-text, intent routing across 15+ specialized agents, bidirectional notifications, and progressive TTS streaming — all operable without a screen. Built from scratch as a solo R&D effort, from WebSocket architecture to LoRA fine-tuned routing models.

            Python
            FastAPI
            WebSocket
            MCP
            Claude Agent SDK
            LoRA
            TTS/STT
          
10x dev velocity
Bidirectional Notification ArchitectureVoice-driven agents need to talk back. I designed a notification system that lets agents request human decisions mid-workflow, stream progress updates, and route trust-calibrated approvals — all through voice. This became the communication primitive that made voice-first agentic workflows practical, boosting development velocity by an order of magnitude.

            WebSocket
            SSE
            Decision Proxy
            Trust Models
          
99% accuracy
Voice Intent RoutingDiscerning a user's intent encoded by their voice requires fine-tuned LoRA adapters running on small open-source LLMs. I've achieved 99% intent classification accuracy across 15+ agent categories and 15+ browser management commands. The routing layer, running on an edge server, decides in milliseconds which agent should handle a spoken request, making the voice interface feel instant and reliable. I presented the Easy PEFT methodology at Google.

            LoRA
            PEFT
            Mistral 8B
          
Voice-compatible workflows
Planning is PromptingWhen your development environment is voice-driven, your project management workflows need to be too. I created a structured planning framework for Claude Code that works entirely through voice — session management, task tracking, and implementation planning, all designed for spoken interaction with AI coding agents.

            Claude Code
            Workflow Design
            cosa-voice MCP
          
15s → 0.25s
Semantic Caching & Code as MemoryIn 2024, I presented at Google how my early work on CoSA utilized an optimized architecture that improves AI agent response times by ~50-100x by teaching agents to recognize when they'd already solved a computationally analogous problem. Semantic caching underpins Lupin's solution snapshot system and leverages code as memory, providing not just faster responses, but also significantly better accuracy on GSM8K.

            Performance Tuning
            Embeddings
            Vector Search
            GSM8K
          
Abstracted prediction engine
AMPE — Interface as InsightAt HelioCampus, I architected AMPE, a prediction engine that abstracted machine learning complexity behind clean interfaces — proving that the right modality layer transforms how humans interact with AI predictions. The lesson: interface design isn't a skin on top of intelligence; it is the intelligence delivery mechanism.

            scikit-learn
            Prediction Engine
            Higher Ed Analytics
          
The thesis seed
Voice-Controlled Sports HighlightsAt Comcast, I created a demo of voice-driven search for sports content — "Show me all shots on goal" This was 2015, before smart speakers were mainstream, and later became a product. This project planted the seed: voice wasn't just a way to search; for many contexts it was the only natural way. Everything since has been building on that realization.

            NLP
            Voice Search
            Content Discovery
          

Research Domains & Technical Competencies

Where I spend my cycles — and what I use to get there.

Voice-Driven Agentic AIPrimary research focus — the interface layer between humans and agent swarms.

            Voice-driven Human-in-the-Loop Agentic Processes · Bidirectional Real-Time Voice I/O · Interrupt Handling & Barge-in · Streaming ASR · Streaming TTS · Time-to-First-Audio Optimization · Voice Activity Detection · Prosody Modeling · Whisper · Distil-Whisper · Google Chirp · Google Speech-to-Text API · ElevenLabs
          
Agentic Architectures & OrchestrationCoordinating groups of agents across synchronous and asynchronous workflows.

            Multi-Agent Orchestration · ReAct · Plan-and-Execute · Tool-Augmented LLMs · Task Decomposition · Chain-of-Thought · Tree-of-Thought · Human-in-the-Loop · Model Context Protocol · LangGraph · LangChain · OpenAI Agents SDK · Google Agents ADK · Claude SDK · Open Interpreter · AutoGPT
          
Preference Learning & Trust SystemsActive R&D — teaching agents when to ask and when to act.

            Online Preference Learning · Trust Proxies · Reward Modeling · DPO · RLHF · Bayesian Online Learning · Gaussian Process Preference Learning · Inverse Reinforcement Learning · Active Learning · Preference Elicitation · Case-Based Reasoning
          
LLM Architecture, Training, Efficiency & ServingMaking models faster, smaller, and smarter at retrieval.

            RAG · Semantic Caching · Synthetic dataset generation and benchmark creation · Code & Text Embeddings · Semantic Similarity · Fine-Tuning · PEFT · QLoRA · LoRA · AWQ · AutoRound · Quantization · Speculative Decoding · KV Cache Optimization · Mixture of Experts · Flash Attention · Hugging Face Transformers
          
ML Foundations & Classical NLPThe statistical and linguistic bedrock underneath everything else.

            Deep Learning · Neural Networks · NLP · Sentiment Analysis · Document Classification · TF-IDF · Classification & Clustering · Logistic & Linear Regression · RNNs · LSTMs · CNNs · SMOTE · SHAP · spaCy · NLTK · Gensim · scikit-learn · XGBoost · LightGBM · Pandas · WandB
          
Infrastructure & SystemsThe production stack that keeps agents running.

            FastAPI · Docker · CUDA · GPUs (dual RTX 4090) · PyTorch · JAX · TensorFlow · Keras · Apache Spark · Parallel & Distributed Computing · Server-Sent Events · Linux · SQL · MySQL · PostgreSQL · GitHub · Vertex AI · Google Cloud
          
ModelsThe models I’ve shipped with, fine-tuned, or benchmarked.

            Claude Opus/Sonnet · Gemini · Mistral-7/8B · Phi4 8B · nomic-embed · CodeRank · Mixtral-8x7B · Phind-CodeLlama-34B-v2 · Llama 3.x · Qwen 2.5 · GPT · Whisper · Distill-Whisper · GloVe · Word2Vec
          
Programming LanguagesPolyglot by necessity, Pythonista by choice.

            Python · JavaScript · Scala · Java · R · SQL

Research Communication

Writing that bridges research and practice.

Medium October 2025

Faster, Better, Morer: How to 5–10x Your Code Generation with Claude Code

Planning-as-prompting methodology; basis for internal Google presentation

Medium April 2025

How I Got Promoted to AI Project Manager in One Short Weekend

Interleaving attention across parallel agentic processes in AI-assisted development using Claude Code

Medium December 2024

How to Give Your LLM's GSM8K Scores a HUGE Bump

An unexpected upside to code-as-memory

LinkedIn September 2023

Adventures in Agentic Behaviors, Parts 1, 2, 3 and 4

Early analysis of emergent agentic AI patterns

LinkedIn May 2023

Chastened AI Admits It Doesn't Know All The Answers

Early critical analysis of LLM epistemic limitations

Credentials & Formation

Education

MA & BA Applied Linguistics — West Virginia University

The linguistics training wasn't incidental — it was foundational. Understanding how humans produce, parse, and negotiate meaning through speech is what made me see voice as the primary modality long before it was fashionable. Every architecture decision I make is informed by how language actually works.

Languages

English — Native | Spanish — Native

Certifications

Google Cloud: Professional Cloud Architect · 2025
Google Cloud: Professional Machine Learning Engineer · 2024
Generative AI with Large Language Models · Coursera · 2023
Natural Language Processing Specialization · Coursera · 2023
Statistics with Python Specialization · Coursera · 2020–2021
Deep Learning Specialization · Deeplearning.ai, Coursera · 2017–2018
Functional Programming in Scala · ÉPFL, Coursera · 2016–2017
Data Science and Engineering with Spark · Berkeley, edX · 2016
Machine Learning · University of Washington, Coursera · 2016
Data Science Specialization · Johns Hopkins, Coursera · 2015–2016

Voice Is the Primary Modalityfor Agentic AI

The Voice-First Thesis

The Observation

The Problem

The Proof

Evidence from the Field

Lupin + CoSA Voice MCP

Bidirectional Notification Architecture

Voice Intent Routing

Planning is Prompting

Semantic Caching & Code as Memory

AMPE — Interface as Insight

Voice-Controlled Sports Highlights

Research Domains & Technical Competencies

Voice-Driven Agentic AI

Agentic Architectures & Orchestration

Preference Learning & Trust Systems

LLM Architecture, Training, Efficiency & Serving

ML Foundations & Classical NLP

Infrastructure & Systems

Models

Programming Languages