Close Menu
  • Home
  • AI
  • Big Data
  • Cloud Computing
  • iOS Development
  • IoT
  • IT/ Cybersecurity
  • Tech
    • Nanotechnology
    • Green Technology
    • Apple
    • Software Development
    • Software Engineering

Subscribe to Updates

Get the latest technology news from Bigteetechhub about IT, Cybersecurity and Big Data.

    What's Hot

    Navigating the labyrinth of forks

    July 18, 2025

    OpenAI unveils ‘ChatGPT agent’ that gives ChatGPT its own computer to autonomously use your email and web apps, download and create files for you

    July 18, 2025

    Big milestone for the future of quantum computing.

    July 18, 2025
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    Big Tee Tech Hub
    • Home
    • AI
    • Big Data
    • Cloud Computing
    • iOS Development
    • IoT
    • IT/ Cybersecurity
    • Tech
      • Nanotechnology
      • Green Technology
      • Apple
      • Software Development
      • Software Engineering
    Big Tee Tech Hub
    Home»Big Data»Scaling the Knowledge Graph Behind Wikipedia
    Big Data

    Scaling the Knowledge Graph Behind Wikipedia

    big tee tech hubBy big tee tech hubJuly 11, 2025006 Mins Read
    Share Facebook Twitter Pinterest Copy Link LinkedIn Tumblr Email Telegram WhatsApp
    Follow Us
    Google News Flipboard
    Scaling the Knowledge Graph Behind Wikipedia
    Share
    Facebook Twitter LinkedIn Pinterest Email Copy Link


    Scaling the Knowledge Graph Behind Wikipedia

    (Image courtesy Wikipedia)

    As the fifth most popular website on the Internet, keeping Wikipedia running smoothly is no small feat. The free encyclopedia hosts more than 65 million articles in 340 different languages, and serves 1.5 billion unique device visits per month. Behind the site’s front-end Web servers are a host of databases serving up data, including a massive knowledge graph hosted by Wikipedia’s sister organization, Wikidata.

    As an open encyclopedia, Wikipedia relies on teams of editors to keep it accurate and up to date. The organization, which was founded in 2001 by Jimmy Sales and Larry Sanger, has established processes to ensure that changes are checked and that the data is accurate. (Even with those processes, some people complain about the accuracy of Wikipedia information.)

    If Wikipedia editors strive to maintain the accuracy of facts in Wikipedia articles, then the goal of the Wikidata knowledge graph is to document where those facts came from and to make those facts easy to share and consume outside of Wikipedia. That sharing includes allowing developers to access Wikipedia facts as machine-readable data that can be used in outside applications, says Lydia Pintscher, the portfolio lead for Wikidata.

    “It’s this basic stock of information that a lot of developers need for their applications,” Pintscher says. “We want to make that available to Wikipedia, but also really to anyone else out there. There are a large number of applications that people build with that data that are not Wikipedia.”

    For instance, data from Wikidata is piped directly into the digital travel assistant KDE Itinerary, which is developed by the free software community KDE (where Pintscher sits on the board). If a user is travelling to a certain country, KDE Itinerary can inform them what side of the road they drive on, or what type of electrical adapter they will need.

    Wikidata logo en.svg

    (Image courtesy Wikidata)

    “You can also say ‘Give me an image of the current mayor of Berlin’ and you will be able to get that, or ‘Give me the Facebook profile of this famous person,’” Pintscher tells BigDATAwire. “You will be able to get that with a simple API call.”

    It is certainly a noble goal to gather the facts of the world into one place and then make them available via API. However, actually building such a system requires more than good intentions. It also requires infrastructure and software that can scale to meet the sizable digital demand.

    When Wikidata started in 2012, the organization selected a semantic graph database called Blazegraph to house the Wikipedia knowledgebase. Blazegraph stores data in sets of Resource Description Framework (RDF) statements called tuples, which roughly correspond to the subject-predicate-object relationship. Blazegraph allows users to query these RDF statements using the SPARQL query language.

    The Wikidata database started out small, but it has grown in leaps and bounds over the years. The size of the database increased substantially in the late 2010s when the team imported large amounts of data related to articles in scientific journals. For the past six years or so, it has grown more modestly. Today, the database encompasses about 116 million items, which corresponds to about 16 billion triples.

    That data growth is putting stress on the underlying data store. “It’s beyond what it was built for,” Pintscher says. “We’re stretching the limits there.”

    semantic triples

    Semantic knowledge graphs store data in RDF triples

    Blazegraph is not a natively distributed database, but Wikidata’s dataset is so big, it has forced the team to manually shard its data so it can fit across multiple servers. The organization runs its own computing infrastructure with about 20 to 30 paid employees of the Wikimedia Foundation.

    Recently, the Wikidata team split the knowledge graph into two, one for the data from the scientific journals and another holding everything else. That doubles the maintenance effort for the Wikidata team, and it also creates more work for developers who want to use data from both databases.

    “What we’re struggling with is really the combination of the size of the data and the pace of change of that data,” Pintscher says. “So there are a lot of edits happening every day on Wikidata, and the amount of queries that people are sending, since it’s a public resource with people building applications on top of it.”

    But the biggest issue facing Wididata is Blazegraph has reached its end of life (EOL). In 2017, Amazon launched its own graph database, called Neptune, atop the open source Blazegraph database, and a year later, it acquired the company behind it. The database has not been updated since then.

    Pintscher and the Wikidata team are looking at alternatives to Blazegraph. The software must be open source and actively maintained. The organization would prefer to have a semantic graph database, and it has looked closely at Qlever and MilleniumDB, among others. It is also considering property graph databases, such as Neo4j.

    “We haven’t made the final decision,” Pintscher says. “But so much of what Wikidata is about is related to RDF and being able to access it in SPARQL, so that is definitely a big factor.”

    Lydia Pintscher

    Lydia Pintscher is the Portfolio Lead for Wikidata

    In the meantime, development work continues. The organization is looking at ways it can provide companies with access to Wikimedia content with certain service level guarantees. It’s also working on building a vector embedding of Wikidata data that can be used in retrieval-augmented generation (RAG) workflows for AI applications.

    Building a free and open knowledge base that encompasses a sizable swath of human knowledge is a noble endeavor. Developers are building interesting and useful application with that data, and in some cases, such as the Organized Crime and Corruption Reporting Project, the data is going to help bring people to justice. That keeps Pintscher and her team motivated to continue pushing to find a new home for what might be the biggest repository of open data on the planet.

    “As someone who spent the last 13 years of her life working on open data, I truly do believe in open data and what it enables, especially because opening up that data allows other people to do things with it that you have not thought of,” Pintscher says. “There’s a ton of stuff that people are using the data for. That’s always great to see, because the work our community is putting into that every single day is paying off.”

    Related Items:

    Groups Step Up to Rescue At-Risk Public Data

    NSF-Funded Data Fabric Takes Flight

    Prolific Puts People, Ethics at Center of Data Curation Platform



    Source link

    graph Knowledge scaling Wikipedia
    Follow on Google News Follow on Flipboard
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
    tonirufai
    big tee tech hub
    • Website

    Related Posts

    Bridging the Digital Chasm: How Enterprises Conquer B2B Integration Roadblocks

    July 18, 2025

    How can a Product Manager be GenAI ready?

    July 17, 2025

    Technical Approaches and Practical Tradeoffs

    July 16, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Editors Picks

    Navigating the labyrinth of forks

    July 18, 2025

    OpenAI unveils ‘ChatGPT agent’ that gives ChatGPT its own computer to autonomously use your email and web apps, download and create files for you

    July 18, 2025

    Big milestone for the future of quantum computing.

    July 18, 2025

    Exploring supersymmetry through twisted bilayer materials – Physics World

    July 18, 2025
    Advertisement
    About Us
    About Us

    Welcome To big tee tech hub. Big tee tech hub is a Professional seo tools Platform. Here we will provide you only interesting content, which you will like very much. We’re dedicated to providing you the best of seo tools, with a focus on dependability and tools. We’re working to turn our passion for seo tools into a booming online website. We hope you enjoy our seo tools as much as we enjoy offering them to you.

    Don't Miss!

    Navigating the labyrinth of forks

    July 18, 2025

    OpenAI unveils ‘ChatGPT agent’ that gives ChatGPT its own computer to autonomously use your email and web apps, download and create files for you

    July 18, 2025

    Subscribe to Updates

    Get the latest technology news from Bigteetechhub about IT, Cybersecurity and Big Data.

      • About Us
      • Contact Us
      • Disclaimer
      • Privacy Policy
      • Terms and Conditions
      © 2025 bigteetechhub.All Right Reserved

      Type above and press Enter to search. Press Esc to cancel.