Projects
Technical Projects
From research prototypes to production systems - bridging the gap between academic innovation and practical implementation
Research Projects
🚀 ORBIT: Dataset Curation for Domain Adaptation
A methodology for curating high-quality domain-specific datasets from noisy web sources, designed for training specialist large language models.
**Key Innovation**: Transforms the expensive, manual process of dataset curation into an automated, cost-effective pipeline that maintains high quality while scaling to billions of tokens. **Technical Approach**: - **Keyword-based Filtering**: Intelligent domain-specific keyword identification - **Quality Scoring**: Multi-layered quality assessment algorithms - **Scalable Pipeline**: Processes terabyte-scale web datasets efficiently - **Cross-domain Validation**: Proven effectiveness across astronomy, law, and medicine
ORBIT Pipeline Architecture
Technical Stack:
- Languages: Python, JAX
- ML Frameworks: PyTorch, Hugging Face Transformers
- Data Processing: Apache Spark, Pandas
- Infrastructure: Google Cloud Platform, Docker
🔍 LiveRAG: Advanced Retrieval-Augmented Generation
Next-generation retrieval-augmented generation system focusing on real-time information access and integration for large language models.
**Core Innovation**: Develops efficient methods for AI systems to access, evaluate, and integrate real-time information while maintaining response quality and speed. **Research Areas**: - **Real-time Retrieval**: Optimized information access with minimal latency - **Quality Filtering**: Intelligent source credibility and relevance scoring - **Integration Efficiency**: Seamless incorporation of retrieved information - **Scalability**: Handling high-throughput query processing **Technical Challenges Addressed**: - Balancing retrieval accuracy with response speed - Managing information freshness vs. reliability trade-offs - Optimizing retrieval-generation pipeline efficiency - Ensuring consistent performance across diverse query types
LiveRAG System Architecture
Technical Stack:
- Languages: Python, TypeScript
- Frameworks: FastAPI, React
- Search: Elasticsearch, Faiss
- ML: Sentence Transformers, OpenAI APIs
📋 Status: Research paper under review. Code and detailed documentation will be released upon publication.
Interested in Collaboration?
I'm always excited to work on projects that combine cutting-edge research with practical impact, including research collaborations and industry partnerships.