Publications
Publications
My research focuses on learning efficiency for AI systems, particularly in domain adaptation and dataset curation. Here are my published works and ongoing research:
Peer-Reviewed Publications
2024
ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case Study
Authors: Eric Modesitt, Ke Yang, Spencer Hulsey, Chengxiang Zhai, Volodymyr Kindratenko
Venue: ACL 2024 Findings
Publication Date: December 19, 2024
arXiv: 2412.14436
Abstract
Recent advances in language modeling demonstrate the need for high-quality domain-specific training data, especially for tasks that require specialized knowledge. General-purpose models, while versatile, often lack the depth needed for expert-level tasks because of limited domain-specific information. Domain adaptation training can enhance these models, but it demands substantial, high-quality data. To address this, we propose ORBIT, a cost-efficient methodology for curating massive, high-quality domain-specific datasets from noisy web sources, tailored for training specialist large language models.
Key Contributions:
- ✨ Methodology Innovation: Developed ORBIT framework for automated, cost-effective dataset curation
- 📊 Empirical Validation: Refined 1.3T-token FineWeb-Edu dataset into high-quality 10B-token astronomy subset
- 🎯 Performance Gains: Achieved 69% → 76% improvement on MMLU astronomy benchmark
- 🏆 Benchmark Leadership: Demonstrated top performance on AstroBench astronomy-specific benchmark
- 🌐 Cross-domain Generalization: Validated methodology across astronomy, law, and medicine domains
- 🤖 Model Release: Open-sourced Orbit-LLaMA model achieving 73% preference in GPT-4o evaluations
Technical Innovations:
- Scalable Filtering Pipeline: Processes terabyte-scale datasets efficiently
- Domain-Specific Keyword Engineering: Intelligent keyword identification and weighting
- Quality Assessment Framework: Multi-layered quality scoring algorithms
- Cost-Efficiency Metrics: Quantified dramatic reduction in curation expenses
Impact & Reception:
- 📈 Industry Adoption: Multiple organizations implementing ORBIT for specialized AI systems
- 🔬 Research Community: High citation rate and follow-up work by other researchers
- 💰 Economic Impact: Enables smaller organizations to create domain-specific AI without massive budgets
- 🌍 Open Science: All methodologies, datasets, and models publicly available
LiveRAG: Advanced Retrieval-Augmented Generation Systems
Authors: Eric Modesitt, [Co-authors TBA]
Venue: SIGIR 2024 LiveRAG Workshop
Publication Date: July 2024
Status: Workshop Paper
Research Focus
This work explores next-generation techniques in retrieval-augmented generation (RAG) systems, focusing on real-time information access, quality filtering, and efficient integration for large language models. The research addresses critical challenges in maintaining information freshness while ensuring reliability and response speed.
Key Research Areas:
- 🚀 Real-time Retrieval: Sub-second information access with maintained accuracy
- 🎯 Quality Filtering: Advanced source credibility and relevance scoring
- ⚡ Integration Efficiency: Seamless incorporation of retrieved information into responses
- 📈 Scalability Solutions: High-throughput query processing architectures
- 🔄 Adaptive Systems: Dynamic adjustment based on query complexity and domain
Technical Contributions:
- Latency-Accuracy Trade-offs: Novel algorithms balancing speed with information quality
- Multi-source Integration: Techniques for combining information from diverse sources
- Reliability Metrics: Framework for assessing information freshness vs. credibility
- Real-world Validation: Testing across diverse domains and query types
Works in Progress
🚧 Future Publications Pipeline
ORBIT 2.0: Multi-Modal Dataset Curation
Target Venue: ICLR 2025
Status: Experiments in progress
Focus: Extending ORBIT methodology to handle text, images, and code simultaneously for more comprehensive domain adaptation.
Efficiency Metrics for Human-AI Learning
Target Venue: CHI 2025
Status: Data collection phase
Focus: Quantitative framework for measuring and optimizing learning efficiency in human-AI collaborative systems.
Industrial Applications of Domain Adaptation
Target Venue: Industry Track - KDD 2025
Status: Collaboration with Capital One
Focus: Real-world deployment lessons from applying academic research in financial services.
Timeline: Expecting 2-3 additional publications in 2025, building on the foundation established by ORBIT and LiveRAG research.
Research Timeline & Evolution
📅 Publication Journey
December 2024 - ACL Findings
ORBIT Paper Published
Breakthrough methodology for domain-specific dataset curation, establishing foundation for specialized AI systems.
July 2024 - SIGIR Workshop
LiveRAG Research Presented
Advanced retrieval-augmented generation techniques, focusing on real-time information access and integration.
2025 Pipeline - Multiple Venues
Next-Generation Research
Building on established foundation with multi-modal approaches, human-AI collaboration, and industrial applications.
Research Impact & Metrics
73%
GPT-4o Preference
for ORBIT-trained models
10B+
Tokens Curated
from 1.3T source dataset
3
Domains Validated
astronomy, law, medicine
100%
Open Source
code, data, and models
Citation Information
Academic Profiles & Networks
Interested in Collaboration?
I'm always looking for opportunities to collaborate on research related to learning efficiency, domain adaptation, and practical AI applications. Whether you're interested in extending existing work or exploring new directions.
Start a Research Conversation