AI AgentResearch ToolsNLPHCI

ScholarLens: Building an AI Agent for Research Trend Discovery

How I designed an agentic system that crawls, labels, and analyzes thousands of academic papers to find blue-ocean opportunities

Xiaohan··10 min read

Every year, top-tier venues like CHI and EMNLP publish thousands of papers. As a PhD student working at the intersection of NLP, HCI, and Computational Social Science, I found myself drowning in tabs, bookmarks, and half-read PDFs. I needed a better way to survey research landscapes, spot emerging trends, and discover the "blue oceans" where my skills could create the most impact. So I designed ScholarLens — an agentic AI system that does exactly that.

The Problem: Information Overload in Academia

CHI 2024 alone accepted over 1,000 papers. EMNLP is similar. Multiply that by a handful of venues across HCI, NLP, and ML, toss in three to four years of history, and you are staring at 5,000+ papers. No human can process that.

Existing tools like Google Scholar and Semantic Scholar are great for searching when you already know what you want. But they fail at the meta questions researchers actually need answered:

  • What topics are growing the fastest in my field?
  • Which cross-domain intersections are still underexplored?
  • Given my specific background, what research directions could I uniquely pursue?

These are not search queries — they are analytical tasks that require ingesting thousands of papers, classifying them, and reasoning over the results. That is exactly what an AI agent can do.

Key Insight: The hardest part of literature review is not finding papers — it is understanding the landscape. ScholarLens automates the landscape mapping so you can focus on the creative leap of identifying your next project.

What Is ScholarLens?

ScholarLens is a six-module pipeline. Raw papers go in; personalized research insights come out. Every step is orchestrated by LLM-powered agents through LangChain, and the entire system is designed so that you bring your own API key — nothing is stored, nothing is tracked.

ScholarLens: End-to-End Pipeline

CrawlPapersLabelClassifyProfileUserRecommendTop PapersDeep ReadAnalyzeTrendBlue Ocean123456

The pipeline works in stages. First, papers are crawled from venues like ACL Anthology, ACM Digital Library, and OpenReview. Then an AI labeler classifies every paper using a hybrid taxonomy of preset tags and emergent labels that the model discovers on its own. Your personal research profile is parsed and matched against the labeled corpus. The recommendation engine surfaces the top papers for you. A deep reader can analyze any paper in full. And finally, the trend analyzer identifies growth trajectories and blue-ocean opportunities personalized to your skills.

System Architecture

The stack is intentionally pragmatic. The frontend is a Next.js 14 app with Tailwind and shadcn/ui for a clean, responsive interface. The backend is Python FastAPI — chosen for its async capabilities and natural integration with LangChain. The database is plain SQLite, because for a single-researcher tool, it is fast, zero-config, and portable.

System Architecture: Next.js + FastAPI + LangChain

FRONTEND — Next.js 14 / React / TailwindDashboardExplorerDeep ReaderTrend & Blue OceanREST APIBACKEND — Python FastAPI + LangChainCrawlerLabelerLangChainTrend AnalyzerUser-owned API keys — passed via headers, never storedSQLite — papers / labels / profiles / cache

The key architectural decision is user-owned API keys. Every LLM call — labeling, recommendation, deep reading, trend analysis — is powered by keys the user provides. The backend receives them via HTTP headers, uses them for that request, and discards them. This makes the system privacy-first by design and allows users to pick any model they prefer: GPT-4o, Claude, DeepSeek, Qwen, or any OpenAI-compatible endpoint.

Module 1: Paper Crawler

The crawler supports three families of data sources:

  • ACL Anthology (EMNLP, ACL, NAACL) — clean HTML or API, freely available PDFs. The easiest source to crawl.
  • ACM Digital Library (CHI, CSCW, UIST) — requires careful rate limiting and Selenium as a fallback for anti-bot measures.
  • OpenReview (NeurIPS, ICML) — an official API that returns structured JSON. The cleanest source.

Users select which venues and year ranges to crawl from a dashboard. The system deduplicates incrementally, so re-running a crawl only fetches new papers. Progress is streamed back to the UI via WebSocket updates.

Design Choice: Why three separate crawling strategies instead of a generic one? Because each data source has radically different structures, anti-scraping measures, and access policies. A strategy-per-source approach is more resilient than a one-size-fits-all scraper.

Module 2: Intelligent Labeler

This is where things get interesting. Every paper needs to be classified — but not just into fixed buckets. The labeler uses a hybrid strategy: a curated preset taxonomy (organized by task type, methodology, application domain, and research type) combined with AI-discovered labels that emerge from the data.

For example, the NLP taxonomy includes preset labels like llm_agent, rag, health_nlp, and social_media_analysis. But the LLM might discover that a cluster of 2025 papers is about "synthetic data generation" — a topic not in the original pool. It creates that label automatically, and the system adds it to the taxonomy for future papers.

Intelligent Labeling: Preset + AI-Discovered Tags

NLP Labels

LLM Agent
RAG
Health NLP
Social Media
Misinformation

HCI Labels

AI-HCI
Accessibility
Social Computing
Health & Wellbeing
Privacy

Labeling happens in batches of ten papers per API call to keep token costs low. For a corpus of 3,000 papers, the initial labeling run consumes roughly one million tokens — about $2-3 with a mid-range model. Each paper gets 2–6 tags with confidence scores, and users can manually override any label to refine the system over time.

Module 3: AI Deep Reader

When you find a paper worth studying, the Deep Reader goes beyond a summary. It runs a structured LLM analysis that produces a multi-faceted report: one-sentence overview, research gap and motivation, methodology breakdown, key findings with evidence, critical assessment of strengths and weaknesses, and — most valuably — ideas for your own research based on your profile.

AI Deep Read: Structured Paper Analysis

Overview
Motivation
Method
Findings
Assessment
Ideas for You
PDFLLM Analysis+ User ProfileStructuredInsights

PDFs are acquired automatically from open sources (ACL Anthology, arXiv, OpenReview) or uploaded manually. The system converts them to Markdown using one of several methods — PyMuPDF for basic extraction, Marker for higher quality, or Mathpix / Gemini for best results. Long papers that exceed the model's context window are split into logical sections, analyzed independently, and synthesized in a final pass.

The "Ideas for You" Tab: This is the feature I am most excited about. The system knows your research background, your methods, your publication venues. When it reads a paper, it generates concrete, actionable ideas for how you could extend that work. It turns passive reading into active ideation.

Module 4: Trend & Blue Ocean Discovery

The trend analyzer is the crown jewel. It operates on three dimensions:

  1. Temporal trends — compute paper counts per label per year, identify growth rates and acceleration. Which topics are surging? Which are plateauing?
  2. Cross-domain mapping — build a co-occurrence matrix between NLP and HCI labels. A topic that is hot in NLP but barely touched in HCI is a potential cross-pollination opportunity.
  3. Personalized blue-ocean discovery — given your unique combination of skills (e.g., NLP + HCI + public health + LLMs), the system identifies intersections where few papers exist but the building blocks are in place.

Blue Ocean Discovery: Finding Underexplored Intersections

NLPLLM AgentsRAGHealth NLPHCIAI-HCISocial Comp.AccessibilityBlueOceanCross-domainopportunitiesHot in NLP, rarely in HCIHot in HCI, rarely in NLP

The LLM synthesizes all three dimensions into a structured report with specific, actionable research questions you could pursue. For a researcher at the intersection of Computational Social Science and NLP, the system might surface ideas like: "Apply LLM agent frameworks to analyze media polarization at scale" — a direction with strong NLP interest but few HCI studies connecting it to real-world health outcomes.

Privacy-First Design

A deliberate constraint shaped the entire architecture: no user data should leave the browser without explicit action. API keys live in localStorage. Research profiles are parsed on-demand and never persisted on the backend. The SQLite database is local. This is a personal research tool, not a SaaS platform.

This design also makes the system model-agnostic. You can use official OpenAI, Anthropic, or Google endpoints, or route through any OpenAI-compatible proxy. Switch models between tasks — use a cheap one for batch labeling, an expensive one for deep reading. The choice is always yours.

What's Next

ScholarLens is currently in active development. The MVP targets two venues — EMNLP and CHI — with Phase 2 expanding to ACL, NAACL, CSCW, UIST, NeurIPS, and ICML. The longer-term goal is a system where users can add any venue and the crawler adapts automatically.

If you are a researcher feeling the weight of thousands of unread papers, the vision is simple: let the agent handle the landscape, so you can focus on the science.


ScholarLens is an open project. If you are interested in contributing or want to try the system when it launches, feel free to reach out.