20241015172327

引言

今年 ICLR 会议的论文审阅已经开放,个人对其中的 AI4Math 主题尤为感兴趣。尽管官网提供了论文检索功能,但操作上有诸多限制,并可能频繁触发使用限制。

OpenReview 检索限制示例

为更灵活地获取所需信息,考虑采用爬虫技术。好在 OpenReview 平台提供了便捷的 REST API 接口,这让直接编程处理成为了可能。

本教程将详细介绍如何利用 OpenReview API 自动提取论文信息,从而使索引、检索、筛选和分类处理变得更加灵活。所有操作均封装为函数,方便在项目中调用。

关于 OpenReview

OpenReview 是一个促进科学交流开放性的平台,尤其专注于提高同行评审过程的透明度。其主要特点如下:

  • 开放同行评审:OpenReview 提供灵活的评审政策选择,使会议和期刊能够根据自身需求配置评审过程。

  • 开放出版:该平台支持稿件的全流程管理,包括提交、编辑、评审协调等,并确保论文可以免费获取,推动开放获取的发展。

  • 开放讨论:接受的论文连同评审意见和评论一同展示,用户可持续进行互动讨论,评论机制的设置和管理权限由出版方决定。

  • 开放 API:OpenReview 提供简单易用的 REST API,用户可以通过它访问和管理用户信息、文档内容及评审邀请等,实现操作自动化及高效权限管理。

  • 开放源码:平台的许多模块代码均开源,用户可在 GitHub 上查看和修改,从而提高平台的透明度及扩展性。

OpenReview API 指南

OpenReview 提供了两种类型的 API:旧版 API 和 API 2。新的会议请求通常默认使用 API 2。虽然大部分操作在这两种 API 之间是通用的,但在某些细节上(例如 JSON 格式),可能会有所不同。

本节,我们基于 OpenReview 的 Python 包编写函数,用于获取和处理论文数据。虽然这些操作在 OpenReview 的网页端也可以进行,但通过 API 可以实现更灵活和高效的信息处理。

更多信息请参阅:OpenReview Documentation

安装与初始化

首先,安装 Python 包 openreview-py

1
pip install openreview-py

接下来,创建一个客户端对象以便进行 API 操作,需要使用 OpenReview 账号信息进行初始化:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import openreview

# 适用于 API V2 的客户端实例
client = openreview.api.OpenReviewClient(
baseurl='https://api2.openreview.net',
username='<your-username>',
password='<your-password>'
)

# 若你需要使用旧版 API
old_client = openreview.Client(
baseurl='https://api.openreview.net',
username='<your-username>',
password='<your-password>'
)

由于教程中的目标是处理最新的论文工作,因此我们主要使用 API 2。

关于组(Groups):每位新用户在创建账户时都会被自动分配到一些组,这些组决定了用户的权限。事实上,用户名本身也是一个组,组与组之间可以互为成员。

获取会议列表

获取已有的会议 ID,目前共 1841 个:

1
2
3
get_venues = lambda client: client.get_group(id='venues').members
venues = get_venues(client)
print(len(venues)) # 1841

获取今年 ICLR 相关的会议 ID:

1
2
3
4
5
# [ven for ven in venues if 'ICLR.cc/2025' in ven]
['ICLR.cc/2025/Conference',
'ICLR.cc/2025/Workshop_Proposals',
'ICLR.cc/2025/BlogPosts'
]

获取论文列表

评审论文存在三种状态,active, withdrawn 以及 desk rejected。通过 status 参数获取和筛选论文:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def get_submissions(client, venue_id, status='all'):
# Retrieve the venue group information
venue_group = client.get_group(venue_id)

# Define the mapping of status to the respective content field
status_mapping = {
"all": venue_group.content['submission_name']['value'],
"accepted": venue_group.id, # Assuming 'accepted' status doesn't have a direct field
"under_review": venue_group.content['submission_venue_id']['value'],
"withdrawn": venue_group.content['withdrawn_venue_id']['value'],
"desk_rejected": venue_group.content['desk_rejected_venue_id']['value']
}

# Fetch the corresponding submission invitation or venue ID
if status in status_mapping:
if status == "all":
# Return all submissions regardless of their status
return client.get_all_notes(invitation=f'{venue_id}/-/{status_mapping[status]}')

# For all other statuses, use the content field 'venueid'
return client.get_all_notes(content={'venueid': status_mapping[status]})

raise ValueError(f"Invalid status: {status}. Valid options are: {list(status_mapping.keys())}")

比如获取 2024 年 ICLR 的 AGI 研讨会论文:

1
get_submissions(client, 'ICLR.cc/2024/Workshop/AGI') # 共 40 篇,接收 34 篇

获取今年的 ICLR 论文,共 11604 篇:

1
2
3
venue_id = 'ICLR.cc/2025/Conference'
submissions = get_submissions(client, venue_id, 'under_review')
print(len(submissions))

信息提取

上一步提取得到的 submissions 数据如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
{'id': 'k8KsI84Ds7',
'forum': 'k8KsI84Ds7',
'content': {'title': {'value': 'Process-Driven Autoformalization in Lean 4'},
'keywords': {'value': ['Large Language Models',
'Autoformalization',
'Lean 4',
'Formal Math',
'Process Supervision',
'Formal Reasoning',
'Mathematical Reasoning',
'AI for Math',
'Automated Theorem Proving']},
'abstract': {'value': 'Autoformalization, the conversion of natural language mathematics into formal languages, offers significant potential for advancing mathematical reasoning. However, existing efforts are limited to formal languages with substantial online corpora and struggle to keep pace with rapidly evolving languages like Lean 4. To bridge this gap, we propose a large-scale dataset \\textbf{Form}alization for \\textbf{L}ean~\\textbf{4} (\\textbf{\\dataset}) designed to comprehensively evaluate the autoformalization capabilities of large language models (LLMs), encompassing both statements and proofs in natural and formal languages. Additionally, we introduce the\n\\textbf{P}rocess-\\textbf{D}riven \\textbf{A}utoformalization (\\textbf{\\method}) framework\nthat leverages the precise feedback from Lean 4 compilers to enhance autoformalization. \nExtensive experiments demonstrate that \\method improves autoformalization, enabling higher compiler accuracy and human-evaluation scores using less filtered training data. \nMoreover, when fine-tuned with data containing detailed process information, \\method exhibits enhanced data utilization, resulting in more substantial improvements in autoformalization for Lean 4.'},
'primary_area': {'value': 'neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)'},
'code_of_ethics': {'value': 'I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.'},
'submission_guidelines': {'value': 'I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.'},
'reciprocal_reviewing': {'value': 'I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.'},
'anonymous_url': {'value': 'I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.'},
'no_acknowledgement_section': {'value': 'I certify that there is no acknowledgement section in this submission for double blind review.'},
'venue': {'value': 'ICLR 2025 Conference Submission'},
'venueid': {'value': 'ICLR.cc/2025/Conference/Submission'},
'TLDR': {'value': 'We introduces the FormL4 benchmark to evaluate autoformalization in Lean 4, along with a process-supervised verifier that enhances the accuracy of LLMs in converting informal statements and proofs into formal ones.'},
'supplementary_material': {'value': '/attachment/68692f2a42584afe8acaad5c9d6e8dbce0c04945.zip'},
'pdf': {'value': '/pdf/0f8b696ac8eb6b5ad91e51aa8b3a61cc42b660d7.pdf'},
'_bibtex': {'value': '@inproceedings{\nanonymous2024processdriven,\ntitle={Process-Driven Autoformalization in Lean 4},\nauthor={Anonymous},\nbooktitle={Submitted to The Thirteenth International Conference on Learning Representations},\nyear={2024},\nurl={https://openreview.net/forum?id=k8KsI84Ds7},\nnote={under review}\n}'}},
'invitations': ['ICLR.cc/2025/Conference/-/Submission',
'ICLR.cc/2025/Conference/-/Post_Submission',
'ICLR.cc/2025/Conference/Submission2128/-/Full_Submission'],
'cdate': 1726825565513,
'odate': 1728008565725,
'mdate': 1728790320313,
'signatures': ['ICLR.cc/2025/Conference/Submission2128/Authors'],
'writers': ['ICLR.cc/2025/Conference',
'ICLR.cc/2025/Conference/Submission2128/Authors'],
'readers': ['everyone'],
'license': 'CC BY 4.0'}

这里只有一些信息是我们需要的,编写函数,提取其中的 标题,摘要,关键词,主要领域,论文简述,论文链接和 PDF 链接

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from datetime import datetime

def extract_submission_info(submission):
# Helper function to convert timestamps to datetime
def convert_timestamp_to_date(timestamp):
return datetime.fromtimestamp(timestamp / 1000).strftime('%Y-%m-%d') if timestamp else None

# Extract the required information
submission_info = {
'id': submission.id,
'title': submission.content['title']['value'],
'abstract': submission.content['abstract']['value'],
'keywords': submission.content['keywords']['value'],
'primary_area': submission.content['primary_area']['value'],
'TLDR': submission.content['TLDR']['value'] if 'TLDR' in submission.content else "",
'creation_date': convert_timestamp_to_date(submission.cdate),
'original_date': convert_timestamp_to_date(submission.odate),
'modification_date': convert_timestamp_to_date(submission.mdate),
'forum_link': f"https://openreview.net/forum?id={submission.id}",
'pdf_link': f"https://openreview.net/pdf?id={submission.id}"
}
return submission_info

信息匹配

基于提取的信息,我们编写一个更灵活的匹配函数,支持文本和正则表达式的匹配,支持对指定字段进行检索:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import re
from typing import Union, List

def contains_text(submission: dict, target_text: str, fields: Union[str, List[str]] = ['title', 'abstract'], is_regex: bool = False) -> bool:
# If 'all', consider all available keys in the submission for matching
if fields == 'all':
fields = ['title', 'abstract', 'keywords', 'primary_area', 'TLDR']

# Convert string input for fields into a list
if isinstance(fields, str):
fields = [fields]

# Iterate over the specified fields
for field in fields:
content = submission.get(field, "")

# Join lists into a single string (e.g., keywords)
if isinstance(content, list):
content = " ".join(content)

# Check if the target_text is found in the content of the field
if is_regex:
if re.search(target_text, content):
return True
else:
if target_text in content:
return True

# If no matches were found, return False
return False

信息检索

基于匹配函数,我们可以检索出符合条件的论文:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def search_submissions(submissions: List[Dict], target_text: str, fields: Union[str, List[str]] = ['title', 'abstract'], is_regex: bool = False) -> List[Dict]:
"""
Search through the list of submissions and return those that match the target text.

:param submissions: List of submission dictionaries to search through.
:param target_text: The text to search for in each submission.
:param fields: The fields to search within for matching. Default is ['title', 'abstract'].
:param is_regex: Boolean flag indicating whether to use regex for matching. Default is False.
:return: List of submissions matching the target text.
"""
# List to hold matching submissions
matching_submissions = []

for submission in submissions:
if contains_text(submission, target_text, fields, is_regex):
matching_submissions.append(submission)

return matching_submissions

小结

以上,我们编写了一组实用的函数工具,用于对 OpenReview 论文的处理

  • get_venues:获取会议列表。
  • get_submissions:获取会议中的论文列表,并支持三种状态筛选:activewithdrawndesk_rejected
  • extract_submission_info:从 submissions 中提取关键信息,如标题、摘要、关键词、主要领域、简述、论文链接和 PDF 链接。
  • contains_text:支持文本和正则表达式的匹配函数,可在指定字段中进行检索。
  • search_submissions:信息检索函数,返回符合条件的论文。

可以将这些函数保存为一个文件(如命名为 paper_tool.py),便于在项目中使用。通常,我们还可以编写后处理函数,将筛选过的论文交给模型进行更深入的分析和处理。

实战

利用上述函数,我们提取 ICLR 2025 年会议中与形式化相关的论文:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from paper_tool import *

# 初始化客户端
client = openreview.api.OpenReviewClient(
baseurl='https://api2.openreview.net',
username='<your-username>', # 你的 OpenReview 用户名
password='<your-password>' # 你的 OpenReview 密码
)

# 获取论文列表
venue_id = 'ICLR.cc/2025/Conference'
all_submissions = get_submissions(client, venue_id)
submissions = get_submissions(client, venue_id, 'under_review')

# 提取论文数据
submission_infos = [extract_submission_info(sub) for sub in submissions]

接下来,我们可以使用 search_submissions 函数来检索符合条件的论文,以下是一个示例:

1
2
3
4
5
6
# 检索关键词
langs = ['Lean', 'Isabelle', 'Coq', 'Lean4']
lang_regex = '|'.join(langs)
matching_submissions = search_submissions(submission_infos, lang_regex, is_regex=True, fields='all')
for mat in matching_submissions:
print(mat['title'])

以上,遇到问题欢迎在评论区留言~