• Tue. Jun 14th, 2022

Academic search engine IST awarded by BCS as “Best Open Source Project”

ByVirginia D. Bannon

Apr 19, 2022

UNIVERSITY PARK, Pennsylvania — CiteSeerXone of the world’s premier open-source academic search engines based in the state of Pennsylvania College of Information Science and Technology (IST), has been recognized by the British Computer Society’s (BCS) Information Retrieval Specialist Group as the Best Open Source Project as part of its Research Industry Awards 2021.

“It’s an honor for Penn State and IST to have this recognition from such an important company,” said C. Lee Giles, David Reese Professor of Information Science and Technology and co-creator of the research.

Originally launched as CiteSeer in 1998 and renamed CiteSeerX in 2008, the search engine was one of the pioneering platforms that implemented the technique of automated citation indexing to connect articles and researchers in network. It actively crawls and harvests academic and scientific papers online and uses automatic citation indexing, allowing users to find related articles using citation graphs. In order to perform this indexing and information extraction at scale, CiteSeerX uses several machine learning methods. It is often considered a predecessor of academic search tools such as Google Scholar and Microsoft Academic Search.

“Automatically, we were able to display the number of citations an article had gotten,” Giles said. “Indexing based on importance was revolutionary at the time.”

“Lee’s expertise in innovation and machine learning, as well as his mastery of developing new specialized search engines, including CiteSeerX, have elevated him to a world-renowned leader in his field,” said added Andrew Sears, dean of the College of IST. “We are proud to join BCS in celebrating Lee and recognizing CiteSeerX as a leading platform more than a decade after its launch.”

CiteSeerX has grown to host over 10 million full-text documents and metadata in English, including 32 million authors and 240 million citations. It has three million individual users worldwide and receives one billion visits and 180 million downloads per year. The code and data supporting CiteSeerX has been open source since its inception, meaning it can be adapted as needed, by anyone, to meet user needs.

“We don’t keep it to ourselves,” Giles said. “We shared it with others so they could build similar systems. Because it’s modular, it can be modified to suit their needs.”

CiteSeerX was funded by the National Science Foundation, Microsoft, NASA, and Penn State College of Information Sciences and Technology. The original search engine, CiteSeer, was created by Giles and his colleagues Kurt Bollacker and Steve Lawrence when they were at the NEC Research Institute (now NEC Laboratories). Its second generation, CiteSeerX, was developed by Giles and Isaac G. Councill, who earned a doctorate from the College of IST in 2006 and continued with the college as a postdoctoral fellow until 2008. The next generation CiteSeerX is in development course at Penn State in collaboration with Jian Wu, assistant professor of computer science at Old Dominion University. According to Wu, the team is “refactoring CiteSeerX from Solr Lucene and mySQL to Elasticsearch, which are all open source.”

The BCS Search Industry Awards recognize individuals, projects and organizations that have excelled in the design of information search and retrieval products and services. A Royal Chartered charity, BCS aims to lead the IT industry through its ethical challenges, support those who work in the industry and make computing good for society. BCS currently has over 60,000 members in 150 countries.

CiteSeerX can be found at citeseerx.ist.psu.edu.