Examples¶
Getting Papers¶
To scrape a paper from the Semantic Scholar (S2) API you first need an S2 paper identifier. This can be found at the end of the URL of a paper on Semantic Scholar.
For example, this paper
has the S2 identifier 8d8844106e7bc83d49ea3544ab2dfc74cd8f258a
The S2 identifier can also be specified based on a paper’s identifier from
other platforms. For example, the same paper on arxiv also has the S2 identifier
arXiv:1407.5648. The convention for different platforms is described on
the S2 API page as well as in the
documentation for api.get_paper(). In fact, we will use this method
to scrape the paper above with the two different identifiers and show that
they indeed give the same paper.
import s2
pid = "8d8844106e7bc83d49ea3544ab2dfc74cd8f258a"
pid2 = "arXiv:1407.5648"
paper = s2.api.get_paper(paperId=pid)
paper2 = s2.api.get_paper(paperId=pid2)
assert paper == paper2
Using an API Key¶
Warning
Be aware of the rate limit (100 requests per 5 minute window) for the public API. Depending on the nature of your use-case (e.g. research), you may apply for the Data Partners API Access to obtain an API key allowing you to scrape papers at a much faster rate. If you share your code, be careful to keep the API key separate.
If you have an API key, it’s really easy to use in one of two ways.
Using the api_key argument¶
paper = s2.api.get_paper(paperId=pid, api_key=API_KEY)
Using a custom requests.Session¶
from requests import Session
session = Session()
session.headers = {'x-api-key': API_KEY}
paper = s2.api.get_paper(paperId=pid, session=session)
Note
Passing an API key through the api_key argument will
temporarily overwrite a key stored in session for that request.
However, the session object itself will remain unchanged.
The same approaches can be used for the api.get_author() function
covered below.
Working Locally with s2.store¶
The s2.store API makes it easy to save and retrieve your S2Paper
and S2Author objects through a dict-like interface.
from s2.store import JsonDS
# path of directory where S2Papers will be saved as jsons
s2paper_json_dir = "pds"
# if the directory does not exist, it is created
# otherwise, previously saved S2Papers become accessible
pds = JsonDS.load_papers(s2paper_json_dir)
# lets save Bill's papers from the previous example
for p in papers:
pds[p.paperId] = p
# now lets delete pdb and recover Bill's papers
del pds
pds = JsonDS.load_papers(s2paper_json_dir)
for p in papers:
p2 = pds[p.paperId]
assert p2 == p
# we can do the same for S2Author objects
ads = JsonDS.load_authors("ads")
ads[author.authorId] = author
# note that setting a value requires the key to be equal to the
# S2 identifier of the object, but this behaviour can be disabled
ads = JsonDS.load_authors("ads", enforce_id=False)
ads["billy"] = author
Saving Objects without S2 Identifiers¶
Sometimes, a S2Reference object may not have a paperId value if
you are using include_unknown_references=True.
In this case, you still may want to save it (e.g. to attempt recovering it
via different methods at a later date). To do this, you can cast it to
S2Paper and create a unique placeholder id
from s2.store import JsonDS
from s2.models import S2Paper
import hashlib
# note that enforce_id=False is not necessary
pds = JsonDS.load_papers("pds")
# lets hunt ourselves an unknown reference from Bill's paper
paper = s2.api.get_paper(
"bdfa1a62c964f19b5ce000d7812ba9f66456a4a4",
params=dict(include_unknown_references=True),
)
for r in paper.references:
if not r.paperId:
break
# create a 40-char key from the hashed content and a signpost prefix
hash = hashlib.md5(r.json().encode("utf-8")).hexdigest()
placeholder_id = f"unknown_{hash}"
pds[placeholder_id] = S2Paper(**r.dict())
Citation graphs with s2.graph¶
The s2.graph API makes it easy to construct citation graphs.
from s2.store import JsonDS
from s2.graph import S2Graph, S2GraphBuilder, MaxPaperHopper
# define the root paper id from which you will construct the graph
paper_id = 'bdfa1a62c964f19b5ce000d7812ba9f66456a4a4'
# create an empty graph with a new JsonDS datastore
# note that by default, S2Graph will use dictionaries which are faster
# but have a larger memory footprint and need to be saved periodically.
graph = S2Graph(papers=JsonDS.load_papers('graph_papers'))
# create a GraphHopper to obtain the neighborhood of a paper;
# MaxPaperHopper(10) will get 10 papers in a breadth-first search
# of the paper's citation network (including the root paper itself).
hopper = MaxPaperHopper(10)
# create the GraphBuilder object and build your graph
# if it is interrupted (e.g. from an error or keyboard interrupt)
# then progress is automatically saved to ``save_path``
builder = S2GraphBuilder(graph=graph, hopper=hopper, save_path='graph.pkl')
builder.from_paper_id(paper_id)
builder.save()
# Note: when a paper is added, all of its neighbours are also added
# to the graph, but not their outgoing edges. This means that you are
# actually scraping 1000s of papers; so feel free to keyboard interrupt
# after a few papers as the builder will automatically save.
builder.load('graph.pkl')
builder.graph.edges[paper_id]