ScholarLens: Building an AI Agent for Research Trend Discovery
How I designed an agentic system that crawls, labels, and analyzes thousands of academic papers to find blue-ocean opportunities
Every year, top-tier venues like CHI and EMNLP publish thousands of papers. As a PhD student working at the intersection of NLP, HCI, and Computational Social Science, I found myself drowning in tabs, bookmarks, and half-read PDFs. I needed a better way to survey research landscapes, spot emerging trends, and discover the "blue oceans" where my skills could create the most impact. So I designed ScholarLens — an agentic AI system that does exactly that.
The Problem: Information Overload in Academia
CHI 2024 alone accepted over 1,000 papers. EMNLP is similar. Multiply that by a handful of venues across HCI, NLP, and ML, toss in three to four years of history, and you are staring at 5,000+ papers. No human can process that.
Existing tools like Google Scholar and Semantic Scholar are great for searching when you already know what you want. But they fail at the meta questions researchers actually need answered:
- What topics are growing the fastest in my field?
- Which cross-domain intersections are still underexplored?
- Given my specific background, what research directions could I uniquely pursue?
These are not search queries — they are analytical tasks that require ingesting thousands of papers, classifying them, and reasoning over the results. That is exactly what an AI agent can do.
What Is ScholarLens?
ScholarLens is a six-module pipeline. Raw papers go in; personalized research insights come out. Every step is orchestrated by LLM-powered agents through LangChain, and the entire system is designed so that you bring your own API key — nothing is stored, nothing is tracked.
ScholarLens: End-to-End Pipeline
The pipeline works in stages. First, papers are crawled from venues like ACL Anthology, ACM Digital Library, and OpenReview. Then an AI labeler classifies every paper using a hybrid taxonomy of preset tags and emergent labels that the model discovers on its own. Your personal research profile is parsed and matched against the labeled corpus. The recommendation engine surfaces the top papers for you. A deep reader can analyze any paper in full. And finally, the trend analyzer identifies growth trajectories and blue-ocean opportunities personalized to your skills.
System Architecture
The stack is intentionally pragmatic. The frontend is a Next.js 14 app with Tailwind and shadcn/ui for a clean, responsive interface. The backend is Python FastAPI — chosen for its async capabilities and natural integration with LangChain. The database is plain SQLite, because for a single-researcher tool, it is fast, zero-config, and portable.
System Architecture: Next.js + FastAPI + LangChain
The key architectural decision is user-owned API keys. Every LLM call — labeling, recommendation, deep reading, trend analysis — is powered by keys the user provides. The backend receives them via HTTP headers, uses them for that request, and discards them. This makes the system privacy-first by design and allows users to pick any model they prefer: GPT-4o, Claude, DeepSeek, Qwen, or any OpenAI-compatible endpoint.
Module 1: Paper Crawler
The crawler supports three families of data sources:
- ACL Anthology (EMNLP, ACL, NAACL) — clean HTML or API, freely available PDFs. The easiest source to crawl.
- ACM Digital Library (CHI, CSCW, UIST) — requires careful rate limiting and Selenium as a fallback for anti-bot measures.
- OpenReview (NeurIPS, ICML) — an official API that returns structured JSON. The cleanest source.
Users select which venues and year ranges to crawl from a dashboard. The system deduplicates incrementally, so re-running a crawl only fetches new papers. Progress is streamed back to the UI via WebSocket updates.
Module 2: Intelligent Labeler
This is where things get interesting. Every paper needs to be classified — but not just into fixed buckets. The labeler uses a hybrid strategy: a curated preset taxonomy (organized by task type, methodology, application domain, and research type) combined with AI-discovered labels that emerge from the data.
For example, the NLP taxonomy includes preset labels like llm_agent, rag, health_nlp, and social_media_analysis. But the LLM might discover that a cluster of 2025 papers is about "synthetic data generation" — a topic not in the original pool. It creates that label automatically, and the system adds it to the taxonomy for future papers.
Intelligent Labeling: Preset + AI-Discovered Tags
NLP Labels
HCI Labels
Labeling happens in batches of ten papers per API call to keep token costs low. For a corpus of 3,000 papers, the initial labeling run consumes roughly one million tokens — about $2-3 with a mid-range model. Each paper gets 2–6 tags with confidence scores, and users can manually override any label to refine the system over time.
Module 3: AI Deep Reader
When you find a paper worth studying, the Deep Reader goes beyond a summary. It runs a structured LLM analysis that produces a multi-faceted report: one-sentence overview, research gap and motivation, methodology breakdown, key findings with evidence, critical assessment of strengths and weaknesses, and — most valuably — ideas for your own research based on your profile.
AI Deep Read: Structured Paper Analysis
PDFs are acquired automatically from open sources (ACL Anthology, arXiv, OpenReview) or uploaded manually. The system converts them to Markdown using one of several methods — PyMuPDF for basic extraction, Marker for higher quality, or Mathpix / Gemini for best results. Long papers that exceed the model's context window are split into logical sections, analyzed independently, and synthesized in a final pass.
Module 4: Trend & Blue Ocean Discovery
The trend analyzer is the crown jewel. It operates on three dimensions:
- Temporal trends — compute paper counts per label per year, identify growth rates and acceleration. Which topics are surging? Which are plateauing?
- Cross-domain mapping — build a co-occurrence matrix between NLP and HCI labels. A topic that is hot in NLP but barely touched in HCI is a potential cross-pollination opportunity.
- Personalized blue-ocean discovery — given your unique combination of skills (e.g., NLP + HCI + public health + LLMs), the system identifies intersections where few papers exist but the building blocks are in place.
Blue Ocean Discovery: Finding Underexplored Intersections
The LLM synthesizes all three dimensions into a structured report with specific, actionable research questions you could pursue. For a researcher at the intersection of Computational Social Science and NLP, the system might surface ideas like: "Apply LLM agent frameworks to analyze media polarization at scale" — a direction with strong NLP interest but few HCI studies connecting it to real-world health outcomes.
Privacy-First Design
A deliberate constraint shaped the entire architecture: no user data should leave the browser without explicit action. API keys live in localStorage. Research profiles are parsed on-demand and never persisted on the backend. The SQLite database is local. This is a personal research tool, not a SaaS platform.
This design also makes the system model-agnostic. You can use official OpenAI, Anthropic, or Google endpoints, or route through any OpenAI-compatible proxy. Switch models between tasks — use a cheap one for batch labeling, an expensive one for deep reading. The choice is always yours.
What's Next
ScholarLens is currently in active development. The MVP targets two venues — EMNLP and CHI — with Phase 2 expanding to ACL, NAACL, CSCW, UIST, NeurIPS, and ICML. The longer-term goal is a system where users can add any venue and the crawler adapts automatically.
If you are a researcher feeling the weight of thousands of unread papers, the vision is simple: let the agent handle the landscape, so you can focus on the science.
ScholarLens is an open project. If you are interested in contributing or want to try the system when it launches, feel free to reach out.