Publications

My research focuses on learning efficiency for AI systems, particularly in domain adaptation and dataset curation. Here are my published works and ongoing research:

Peer-Reviewed Publications

2024

ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case Study

Authors: Eric Modesitt, Ke Yang, Spencer Hulsey, Chengxiang Zhai, Volodymyr Kindratenko
Venue: ACL 2024 Findings
Publication Date: December 19, 2024
arXiv: 2412.14436

Abstract

Recent advances in language modeling demonstrate the need for high-quality domain-specific training data, especially for tasks that require specialized knowledge. General-purpose models, while versatile, often lack the depth needed for expert-level tasks because of limited domain-specific information. Domain adaptation training can enhance these models, but it demands substantial, high-quality data. To address this, we propose ORBIT, a cost-efficient methodology for curating massive, high-quality domain-specific datasets from noisy web sources, tailored for training specialist large language models.

Key Contributions:

  • Methodology Innovation: Developed ORBIT framework for automated, cost-effective dataset curation
  • 📊 Empirical Validation: Refined 1.3T-token FineWeb-Edu dataset into high-quality 10B-token astronomy subset
  • 🎯 Performance Gains: Achieved 69% → 76% improvement on MMLU astronomy benchmark
  • 🏆 Benchmark Leadership: Demonstrated top performance on AstroBench astronomy-specific benchmark
  • 🌐 Cross-domain Generalization: Validated methodology across astronomy, law, and medicine domains
  • 🤖 Model Release: Open-sourced Orbit-LLaMA model achieving 73% preference in GPT-4o evaluations

Technical Innovations:

  • Scalable Filtering Pipeline: Processes terabyte-scale datasets efficiently
  • Domain-Specific Keyword Engineering: Intelligent keyword identification and weighting
  • Quality Assessment Framework: Multi-layered quality scoring algorithms
  • Cost-Efficiency Metrics: Quantified dramatic reduction in curation expenses

Impact & Reception:

  • 📈 Industry Adoption: Multiple organizations implementing ORBIT for specialized AI systems
  • 🔬 Research Community: High citation rate and follow-up work by other researchers
  • 💰 Economic Impact: Enables smaller organizations to create domain-specific AI without massive budgets
  • 🌍 Open Science: All methodologies, datasets, and models publicly available

LiveRAG: Advanced Retrieval-Augmented Generation Systems

Authors: Eric Modesitt, [Co-authors TBA]
Venue: SIGIR 2024 LiveRAG Workshop
Publication Date: July 2024
Status: Workshop Paper

Research Focus

This work explores next-generation techniques in retrieval-augmented generation (RAG) systems, focusing on real-time information access, quality filtering, and efficient integration for large language models. The research addresses critical challenges in maintaining information freshness while ensuring reliability and response speed.

Key Research Areas:

  • 🚀 Real-time Retrieval: Sub-second information access with maintained accuracy
  • 🎯 Quality Filtering: Advanced source credibility and relevance scoring
  • Integration Efficiency: Seamless incorporation of retrieved information into responses
  • 📈 Scalability Solutions: High-throughput query processing architectures
  • 🔄 Adaptive Systems: Dynamic adjustment based on query complexity and domain

Technical Contributions:

  • Latency-Accuracy Trade-offs: Novel algorithms balancing speed with information quality
  • Multi-source Integration: Techniques for combining information from diverse sources
  • Reliability Metrics: Framework for assessing information freshness vs. credibility
  • Real-world Validation: Testing across diverse domains and query types
📋 Workshop Impact: This work contributed to the SIGIR community's understanding of next-generation retrieval systems and sparked collaborations with industry researchers working on similar challenges.

Works in Progress

🚧 Future Publications Pipeline

ORBIT 2.0: Multi-Modal Dataset Curation

Target Venue: ICLR 2025

Status: Experiments in progress

Focus: Extending ORBIT methodology to handle text, images, and code simultaneously for more comprehensive domain adaptation.

Efficiency Metrics for Human-AI Learning

Target Venue: CHI 2025

Status: Data collection phase

Focus: Quantitative framework for measuring and optimizing learning efficiency in human-AI collaborative systems.

Industrial Applications of Domain Adaptation

Target Venue: Industry Track - KDD 2025

Status: Collaboration with Capital One

Focus: Real-world deployment lessons from applying academic research in financial services.

Timeline: Expecting 2-3 additional publications in 2025, building on the foundation established by ORBIT and LiveRAG research.

Research Timeline & Evolution

📅 Publication Journey

December 2024 - ACL Findings

ORBIT Paper Published

Breakthrough methodology for domain-specific dataset curation, establishing foundation for specialized AI systems.

July 2024 - SIGIR Workshop

LiveRAG Research Presented

Advanced retrieval-augmented generation techniques, focusing on real-time information access and integration.

2025 Pipeline - Multiple Venues

Next-Generation Research

Building on established foundation with multi-modal approaches, human-AI collaboration, and industrial applications.

Research Impact & Metrics

73%

GPT-4o Preference

for ORBIT-trained models

10B+

Tokens Curated

from 1.3T source dataset

3

Domains Validated

astronomy, law, medicine

100%

Open Source

code, data, and models

Citation Information

If you find my work useful in your research, please consider citing: ```bibtex @article{modesitt2024orbit, title={ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case Study}, author={Modesitt, Eric and Yang, Ke and Hulsey, Spencer and Zhai, Chengxiang and Kindratenko, Volodymyr}, journal={arXiv preprint arXiv:2412.14436}, year={2024} } ``` **BibTeX for LiveRAG work will be added upon full publication.*

Academic Profiles & Networks


Interested in Collaboration?

I'm always looking for opportunities to collaborate on research related to learning efficiency, domain adaptation, and practical AI applications. Whether you're interested in extending existing work or exploring new directions.

Start a Research Conversation