Close Menu
  • Home
  • AI
  • Big Data
  • Cloud Computing
  • iOS Development
  • IoT
  • IT/ Cybersecurity
  • Tech
    • Nanotechnology
    • Green Technology
    • Apple
    • Software Development
    • Software Engineering

Subscribe to Updates

Get the latest technology news from Bigteetechhub about IT, Cybersecurity and Big Data.

    What's Hot

    Reinventing the Python Notebook with Akshay Agrawal

    March 10, 2026

    ‘Distributed energy resources’ like EVs and heat pumps could reduce 10% of peak electricity demand in B.C.: study

    March 10, 2026

    Top 7 Free Anthropic AI Courses with Certificates

    March 10, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    Big Tee Tech Hub
    • Home
    • AI
    • Big Data
    • Cloud Computing
    • iOS Development
    • IoT
    • IT/ Cybersecurity
    • Tech
      • Nanotechnology
      • Green Technology
      • Apple
      • Software Development
      • Software Engineering
    Big Tee Tech Hub
    Home»Artificial Intelligence»Achieving 10,000x training data reduction with high-fidelity labels
    Artificial Intelligence

    Achieving 10,000x training data reduction with high-fidelity labels

    big tee tech hubBy big tee tech hubAugust 11, 2025012 Mins Read
    Share Facebook Twitter Pinterest Copy Link LinkedIn Tumblr Email Telegram WhatsApp
    Follow Us
    Google News Flipboard
    Achieving 10,000x training data reduction with high-fidelity labels
    Share
    Facebook Twitter LinkedIn Pinterest Email Copy Link


    Experiments

    We wanted to understand which models and tasks would benefit most from our curation process. As baselines for our experiments, we fine-tuned two LLMs of different sizes (Gemini Nano-1 with 1.8B parameters and Nano-2 with 3.25B parameters) on two tasks of different complexity (lower and higher, based on expert alignment) using crowdsourced labels. Each crowdsourced data set has ~100K annotations and a strong class imbalance, with around 95% benign labels on average.

    We compared each of these four baseline conditions against the corresponding curated condition in which each model (Nano-1 and Nano-2) is fine-tuned over multiple rounds using the curation process described above. At each iteration, we selected our curated set of examples and used them for model evaluation and fine-tuning, as described above. All models plateaued before reaching parity with the experts’ internal alignment, so we stopped at 6 iterations (~400 fine-tuning and ~250 evaluation samples) for the lower complexity task and 5 iterations (~250 fine-tuning and ~150 evaluation samples) for the higher complexity task. (Note that the lower complexity task had a larger variety of examples, which may account for the longer time needed to converge.) Both data sets had a final class balance of ~40% positive examples.

    The table below provides an overview of the scale and quality of the data used in each condition. Experts reached an average pairwise Cohen’s Kappa of .81 (on the lower complexity task) and .78 (on the higher complexity task) through the curation process. We consider these the ceiling for model performance. To assess the quality of our crowdsourced data, we calculated Kappa alignment between crowdsourced annotations and experts based on our full curated set, which was .59 (lower complexity) and .41 (higher complexity).



    Source link

    10000x achieving Data highfidelity labels reduction Training
    Follow on Google News Follow on Flipboard
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
    tonirufai
    big tee tech hub
    • Website

    Related Posts

    Self-managed observability: Running agentic AI inside your boundary 

    March 9, 2026

    Celebrate Data Privacy Day by Applying These Best Practices

    March 9, 2026

    The “Data Center Rebellion” Is Here – O’Reilly

    March 9, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Editors Picks

    Reinventing the Python Notebook with Akshay Agrawal

    March 10, 2026

    ‘Distributed energy resources’ like EVs and heat pumps could reduce 10% of peak electricity demand in B.C.: study

    March 10, 2026

    Top 7 Free Anthropic AI Courses with Certificates

    March 10, 2026

    Spark Capital, the first VC firm to back Anthropic in 2023, is raising about $3B in new funds, 50% more than the size of funds it raised two years ago (The Information)

    March 10, 2026
    About Us
    About Us

    Welcome To big tee tech hub. Big tee tech hub is a Professional seo tools Platform. Here we will provide you only interesting content, which you will like very much. We’re dedicated to providing you the best of seo tools, with a focus on dependability and tools. We’re working to turn our passion for seo tools into a booming online website. We hope you enjoy our seo tools as much as we enjoy offering them to you.

    Don't Miss!

    Reinventing the Python Notebook with Akshay Agrawal

    March 10, 2026

    ‘Distributed energy resources’ like EVs and heat pumps could reduce 10% of peak electricity demand in B.C.: study

    March 10, 2026

    Subscribe to Updates

    Get the latest technology news from Bigteetechhub about IT, Cybersecurity and Big Data.

      • About Us
      • Contact Us
      • Disclaimer
      • Privacy Policy
      • Terms and Conditions
      © 2026 bigteetechhub.All Right Reserved

      Type above and press Enter to search. Press Esc to cancel.