Technical Projects

From research prototypes to production systems - bridging the gap between academic innovation and practical implementation

Research Projects

🚀 ORBIT: Dataset Curation for Domain Adaptation

A methodology for curating high-quality domain-specific datasets from noisy web sources, designed for training specialist large language models.

**Key Innovation**: Transforms the expensive, manual process of dataset curation into an automated, cost-effective pipeline that maintains high quality while scaling to billions of tokens. **Technical Approach**: - **Keyword-based Filtering**: Intelligent domain-specific keyword identification - **Quality Scoring**: Multi-layered quality assessment algorithms - **Scalable Pipeline**: Processes terabyte-scale web datasets efficiently - **Cross-domain Validation**: Proven effectiveness across astronomy, law, and medicine
ORBIT Architecture

ORBIT Pipeline Architecture

Technical Stack:

  • Languages: Python, JAX
  • ML Frameworks: PyTorch, Hugging Face Transformers
  • Data Processing: Apache Spark, Pandas
  • Infrastructure: Google Cloud Platform, Docker

🔍 LiveRAG: Advanced Retrieval-Augmented Generation

Next-generation retrieval-augmented generation system focusing on real-time information access and integration for large language models.

**Core Innovation**: Develops efficient methods for AI systems to access, evaluate, and integrate real-time information while maintaining response quality and speed. **Research Areas**: - **Real-time Retrieval**: Optimized information access with minimal latency - **Quality Filtering**: Intelligent source credibility and relevance scoring - **Integration Efficiency**: Seamless incorporation of retrieved information - **Scalability**: Handling high-throughput query processing **Technical Challenges Addressed**: - Balancing retrieval accuracy with response speed - Managing information freshness vs. reliability trade-offs - Optimizing retrieval-generation pipeline efficiency - Ensuring consistent performance across diverse query types
LiveRAG System

LiveRAG System Architecture

Technical Stack:

  • Languages: Python, TypeScript
  • Frameworks: FastAPI, React
  • Search: Elasticsearch, Faiss
  • ML: Sentence Transformers, OpenAI APIs
📋 Status: Research paper under review. Code and detailed documentation will be released upon publication.

Interested in Collaboration?

I'm always excited to work on projects that combine cutting-edge research with practical impact, including research collaborations and industry partnerships.