Appendices Appendix 1 – Description of the data collection process
PIPELINE 1 Flow Diagram
This pipeline uses PUBMED search (unspecified) engine, which is a part of NCBI’s
Entrez System to retrieve all relevant citations for a given keyword. Step 1: Retrieving all PubMed Ids from PubMed for each keyword by using the
The program used is PubmedIDs.java; it takes a .txt file as parameter (the .txt file
should contain keywords for each disease.) and returns a directory, which contains
.txt files for each keyword and a summary.txt. These .txt files contain keyword and the related PubMed Ids.
The Call used to get the PubMed Ids is given below http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db="+dbName+"&term="+
keyWord+"&datetype=edat&retmax="+retMax+"&usehistory=y
This call takes in three parameters database name, keyword, and maximum number
to be retrieved. The database name here is pubmed, the keyword values are taken
from the input file. The value of retMax is obtained by executing the above URL for each keyword and storing the count value from the xml output.
The default values are date and usehistory, which have the values of “edate” and “y”
The Total Records obtained and total time taken to retrieve data for each disease
Step2: Removing duplicates from the collected PubMed Ids.
Implementation of Step2: The program used is CreateUnique.java; it takes in the .txt file as parameter (the
.txt file contains PubMed Ids for all the keywords.) and returns a .txt file, which has
unique set of PubMed Ids. The Total Records obtained and total time taken to create
PIPELINE 2
This pipeline uses PUBMED search (unspecified) engine, which is a part of NCBI’s
Entrez System to retrieve all relevant citations for a given keyword.
Step 1: Retrieving all PubMed Ids from PubMed for all keywords at once. This would
give unique set of PubMed Ids for all the keywords. Implementation of Step 1:
The program used is PubmedIDs1.java; it takes a .txt file as parameter (the .txt file
should be modified to contain keywords for each disease. All White Spaces should be replaced with “%20” and all Paragraph Marks should be replaces with “+or+”) and
returns a directory, which contains UniquePubmed.txt and summary.txt.
The Call used to get the PubMed Ids is given below
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db="+dbName+"&term="+
keyWord+"&datetype=edat&retmax="+retMax+"&usehistory=y
This call takes in three parameters database name, keyword, and maximum number
to be retrieved. The database name here is pubmed, the keyword value is taken
from the input file. The value of retMax is obtained by executing the above URL for each keyword and storing the count value from the xml output.
The default values are date and usehistory, which have the values of “edate” and “y”
The Total Records obtained and total time taken to retrieve data for each disease
PIPELINE 3
This pipeline uses OMIM and PUBMED search engines, which are a part of NCBI’s
Entrez System to retrieve all relevant citations for a given keyword. Step 1: Use OMIM search engine to query a given keyword. We obtain a set of OMIM
Ids for that keyword. We can obtain OMIM Ids in two ways. One is by querying OMIM
one keyword at a time (this process involves duplicates) and the other by sending all keywords as a single keyword to OMIM Implementation of Step 1: First Method The program used is OmimIds.java (this program removes duplicates by using
CreateUnique.java); it takes a .txt file as parameter (the .txt file should contain
keywords for each disease.) and returns three files two .txt files and one summary.txt file. The first .txt file contains keywords and their related OMIM Ids.
Second .txt file contains the unique set of OMIM Ids.
Second Method The program used is OmPubPipeline.java; it takes a unique_omim_***.txt file as input (this file should contain the unique set of OMIM ids) and returns an
omim_***_link.txt along with other .txt files. This omim_***_link.txt file contains
the OMIM ID and its related PubMed Ids. The call used for this purpose is
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom="+toDb+"&db=pubme
This call takes two parameters dbfrom and id. The dbfrom is set as “omim”, db as
“pubmed” and id would get values from the input file.
First Method (Before and After Removing Duplicates)
Step 2: Retrieve linked PUBMED citations for each OMIM entry obtained in Step 1 by
The program used is OmPubPipeline.java; it takes a unique_omim_***.txt file as
input (this file should contain the unique set of OMIM ids) and returns an omim_***_link.txt along with other .txt files. This omim_***_link.txt file contains
the OMIM ID and its related PubMed Ids. The call used for this purpose is
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom="+toDb+"&db=pubme
This call takes two parameters dbfrom and id. The dbfrom is set as “omim”, db as “pubmed” and id would get values from the input file.
The Total Records obtained and total time taken to retrieve data for each disease
PIPELINE 4
This pipeline uses OMIM and PUBMED search engines, which are a part of NCBI’s
Entrez System to retrieve all relevant citations for a given keyword. Step 1: Use OMIM search engine to query a given keyword. We obtain a set of OMIM
Ids for that keyword. We already have the OMIM Ids from step 1 of pipeline 3.Those
OMIM Ids are used. Step 2: Extract linked PUBMED citations for each OMIM entry obtained in Step 1 by
Implementation of Step2: The program used is OmPubPipeline.java; it takes a unique_omim_***.txt file as
input (this file should contain the unique set of OMIM ids) and returns an
omim_***_parse.txt along with other .txt files. This omim_***_parse.txt file
contains the OMIM ID and its related PubMed Ids. The call used for this purpose is http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=omim&dopt=xml&tmpl=dispomimTemplate&list_uids="+keyWord
This call takes one parameter id. The id would get values from the input file. The default values are db, dopt, tmpl, and cmd, which have the values of “omim”,”xml” and
The Total Records obtained and total time taken to retrieve data for each disease
PIPELINE 5
This pipeline uses OMIM and PUBMED search engines, which are a part of NCBI’s
Entrez System to retrieve all relevant citations for a given keyword.
Step 1: Use OMIM search engine to query a given keyword. We obtain a set of OMIM
Ids for that keyword. We already have the OMIM Ids from step 1 of pipeline 3.Those
Step 2: Retrieve linked PUBMED citations for all OMIM entries obtained in Step 1 by sending all entries as a single keyword to OMIM.
Implementation of Step2: The program used is OmPubPipeline.java; it takes a *.txt file as input (this file
should contain the unique set of OMIM ids. It has to be modified such that there are
no greater than 1,161 records in each line. OMIM would not accept more than 1,161
records at one time. These records should be separated by “,’”) and returns an omim_***_all.txt. This omim_***_all.txt file contains the PubMed Ids. The call used
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom="+toDb+"&db=pubmed&id="+queryStr
This call takes three parameters dbfrom, db and id. The dbfrom is set as “omim”, db as “pubmed” and id would get values from the input file. The Total Records obtained and total time taken to retrieve data for each disease condition using this call is
PIPELINE6 Flow Diagram NUCLEOTIDE
This pipeline uses OMIM, NUCLEOTIDE and PUBMED search engines, which are a part
of NCBI’s Entrez System to retrieve all relevant citations for a given keyword.
Step 1: Use OMIM search engine to query a given keyword. We obtain a set of OMIM entries for that keyword. We already have the OMIM Ids from step 1 of pipeline
Step 2: Extract NUCLEOTIDE links for each OMIM entry obtained above by using
Nucleotide links option Implementation of Step 2
The program used is OmNuPipeline.java; it takes a unique_omim_***.txt file as input (this file should contain the unique set of OMIM ids) and returns an
omim_nucleotide.txt along with omim_nucleotide_unique .txt files (this is created by
using CreateUnique.java). This omim_ nucleotide_unique.txt file contains the unique
NUCLEOTIDE ids for the OMIM ids collected in step 1.The call used for this purpose is http://www.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=omim&db="+toDb+"&i
This call takes two parameters db and id. The dbfrom is set as “omim”, db as “nucleotide” and id would get values from the input file. The Total Records obtained and total time taken to retrieve data for each disease
Step 3: Retrieve linked PUBMED citations for each NUCLEOTIDE entry obtained in
Implementation of Step 3: The unique records obtained from step 2 are now taken as input and the program
returns a file diabetes_nucleotide_link.txt. This ***_nucleotide_link.txt file contains
the OMIM ID and its related PubMed Ids. The call used here is
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom="+toDb+"&db=pubme
This call takes two parameters dbfrom and id. The dbfrom is set as “nucleotide”, db as “pubmed” and id would get values from the diabetes_nucleotide_unique.txt.
The Total Records obtained and total time taken to retrieve data for each disease
PIPELINE7 NUCLEOTIDE
This pipeline uses OMIM, NUCLEOTIDE and PUBMED search engines, which are a part
of NCBI’s Entrez System to retrieve all relevant citations for a given keyword. Step 1: Use OMIM search engine to query a given keyword. We obtain a set of OMIM
entries for that keyword. We already have the OMIM Ids from step 1 of pipeline 3.Those OMIM Ids are used.
Step 2: Extract NUCLEOTIDE links for each OMIM entry obtained above by using
Nucleotide links option. We have the Nucleotide Ids from the Pipeline 6 step 2.
Step 3:Extract linked PUBMED citations for each NUCLEOTIDE entry obtained in Step
2 by parsing down each Nucleotide record. Implementation of Step 3 The unique records obtained from step 3 are now taken as input and the program
returns a file diabetes_nucleotide_parse.txt. This ***_nucleotide_parse.txt file
contains the Keyword, Nucleotide Id and its related PubMed Ids.
"http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db="+toDb+"&list_u
ids="+keyWord.substring(1)+"&dopt=xml&term=&qty=1"
This call takes two parameters “db” and “list_uids”. The list_uids would get values from the input file and db would be nucleotide. The default values are dopt, tmpl, and cmd, which have the values of “nucleotide”, ”xml” and “retrieve”. The Total Records obtained and total time taken to retrieve data for each disease
PIPELINE8 NUCLEOTIDE
This pipeline uses OMIM, NUCLEOTIDE and PUBMED search engines, which are a part
of NCBI’s Entrez System to retrieve all relevant citations for a given keyword. Step 1: Use OMIM search engine to query a given keyword. We obtain a set of OMIM
entries for that keyword. We already have the OMIM Ids from step 1 of pipeline 3.Those OMIM Ids are used.
Step 2: Extract NUCLEOTIDE links for each OMIM entry obtained above by using
Nucleotide links option. We have the Nucleotide Ids from the Pipeline 6 step 2. Step 3:Extract linked PUBMED citations for all NUCLEOTIDE entries obtained in Step
The program used is OmNuPipeline.java; it takes a *.txt file as input (this file should
contain the unique set of OMIM ids. It has to be modified such that there are no greater than 600 records in each line. NUCLEOTIDE would not accept more than 600
records at one time. These records should be separated by “,’”) and returns an
omim_***_all.txt. This omim_***_all.txt file contains the PubMed Ids. The call used
for this purpose is http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom="+toDb+"&db=pubme
This call takes three parameters dbfrom, db and id. The dbfrom is set as “nucleotide”, db as “pubmed” and id would get values from the input file. The Total Records obtained and total time taken to retrieve data for each disease
PIPELINE9 Flow Diagram
This pipeline uses OMIM, PROTEIN and PUBMED search engines, which are a part of
NCBI’s Entrez System to retrieve all relevant citations for a given keyword. Step 1: Use OMIM search engine to query a given keyword. We obtain a set of OMIM
entries for that keyword. We already have the OMIM Ids from step 1 of pipeline
3.Those OMIM Ids are used. Step 2: Extract PROTEIN links for each OMIM entry obtained above by using Protein
Implementation of Step 2 The program used is OmPrPipeline.java; it takes a unique_omim_***.txt file as input
(this file should contain the unique set of OMIM ids) and returns an omim_protein.txt
along with omim_protein_unique .txt files (this is created by using CreateUnique.java). This omim_ protein_unique.txt file contains the unique PROTEIN
ids for the OMIM ids collected in step 1.The call used for this purpose is
http://www.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=omim&db="+toDb+"&id="+keyWord
This call takes two parameters db and id. The dbfrom is set as “omim”, db as “protein” and id would get values from the input file. The Total Records obtained and total time taken to retrieve data for each disease condition using this call is
Step 3: Retrieve linked PUBMED citations for each PROTEIN entry obtained in Step 2
The unique records obtained from step 2 are now taken as input and the program
returns a file diabetes_protein_link.txt. This ***_protein_link.txt file contains the
OMIM ID and its related PubMed Ids. The call used here is http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom="+toDb+"&db=pubme
This call takes two parameters dbfrom and id. The dbfrom is set as “protein”, db as
“pubmed” and id would get values from the diabetes_protein_unique.txt.
The Total Records obtained and total time taken to retrieve data for each disease condition using this call is
PIPELINE10 Flow Diagram
This pipeline uses OMIM, PROTEIN and PUBMED search engines, which are a part of
NCBI’s Entrez System to retrieve all relevant citations for a given keyword. Step 1: Use OMIM search engine to query a given keyword. We obtain a set of OMIM
entries for that keyword. We already have the OMIM Ids from step 1 of pipeline
3.Those OMIM Ids are used. Step 2: Extract PROTEIN links for each OMIM entry obtained above by using Protein
links option. We have the Protein Ids from the Pipeline 6 step 2.
Step 3:Extract linked PUBMED citations for each PROTEIN entry obtained in Step 2
by parsing down each Protein record. Implementation of Step 3 The unique records obtained from step 3 are now taken as input and the program
returns a file diabetes_protein_parse.txt. This ***_protein_parse.txt file contains the
Keyword, Protein Id and its related PubMed Ids.
"http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db="+toDb+"&list_u
ids="+keyWord.substring(1)+"&dopt=xml&term=&qty=1"
This call takes two parameters “db” and “list_uids”. The list_uids would get values from the input file and db would be protein. The default values are dopt, tmpl, and cmd, which have the values of “protein”, ”xml” and “retrieve”.
The Total Records obtained and total time taken to retrieve data for each disease condition using this call is
PIPELINE11
This pipeline uses OMIM, PROTEIN and PUBMED search engines, which are a part of NCBI’s Entrez System to retrieve all relevant citations for a given keyword.
Step 1: Use OMIM search engine to query a given keyword. We obtain a set of OMIM
entries for that keyword. We already have the OMIM Ids from step 1 of pipeline 3.Those OMIM Ids are used.
Step 2: Extract PROTEIN links for each OMIM entry obtained above by using Protein
links option. We have the Protein Ids from the Pipeline 6 step 2.
Step 3:Extract linked PUBMED citations for all PROTEIN entries obtained in Step 2. Implementation of Step 3: The program used is OmPrPipeline.java; it takes a *.txt file as input (this file should
contain the unique set of OMIM ids. It has to be modified such that there are no
greater than 600 records in each line. NUCLEOTIDE would not accept more than 600 records at one time. These records should be separated by “,’”) and returns an
omim_***_all.txt. This omim_***_all.txt file contains the PubMed Ids. The call used
for this purpose is http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom="+toDb+"&db=pubme
This call takes three parameters dbfrom, db and id. The dbfrom is set as “protein”, db as “pubmed” and id would get values from the input file. The Total Records obtained and total time taken to retrieve data for each disease condition using this call is
Appendix 2 – Keywords Keywords for Aging
Keywords for Cancer
Tumor
Haemangioendothelioblastoma Phaeochromocytoma
Interstitial radiation therapy Rhabdosarcoma
Keywords for Diabetes Type II diabetes Appendix 3 – Comparing similar links
Total Overlap Results for Diabetes
Nucleotide Protein PUBMED (number and % of) C1
Total Overlap Results for Aging
Nucleotide Protein PUBMED (Number and % of) M1
Total Overlap Results for Cancer
Nucleotide Protein PUBMED (Number and % of) M1
Appendix 4 – Comparing evaluation paths Total overlap for Diabetes
Appendix 5 - Time (in seconds) used to collect the datasets
Appendix 6 – How to obtain data source cardinality at NCBI ( from http://www.ncbi.nlm.nih.gov/Sitemap/Summary/statistics.html#Entrez DatabaseStats)
Other Entrez databases do not have explicit statistics web pages, but you see the number of records in each Entrez database by viewing the index of the Filter field. Each database has the term "all" in its Filter field. The number in parentheses beside that term is the number of records currently present in the database. For example, to see the number of records in the NCBI Structure database, follow these steps (the links will open in a separate window):
• From the Entrez home page, follow the link for the Structure
• On the Entrez Structure database page, select Preview/Index
from the grey area under the search box There are two search boxes on the Preview/Index page: (a) the search box near the top of the page shows the active query; (b) the search box near the bottom of the page is like a "worksheet" that allows you to browse the index of a search field of interest and/or to select one or more terms from the index for addition to your active query
• Select the Filter field from the pop-up menu of searchable fields
that is shown beside the lower search box.
• Enter "all" (without quotes) as the search term and press the Index
button A window will appear at the bottom of the page that allows you to see your term in the index of the search field, and to browse up and down the index. (Tip: If no term is entered in the search box before pressing the "Index" button, the system will automatically take you to the first term in the index. Entering a search term simply forces the system to jump to a specific part of the index.)
• The number in parentheses beside the term "all" is the number of
records currently in the structure database.
References Bartlett, J. C., E. G. Toms (2005). "Developing a Protocol for Bioinformatics
Analysis: An Integrated Information Behaviors and Task Analysis Approach." Journal of the American Society for Information Science and Technology 56(5): 469-482.
Burks, C. (1999). "Molecular Biology Database List." Nucleic Acids Res. 27: 1 - 9. Cohen-Boulakia, S., Susan Davidson, Christine Froidevaux, Zoé Lacroix, Maria-
Esther Vidal (2006). "Path-based systems to guide life scientists in the maze of biological data sources." Journal of Bioinformatics and Computational Biology.
Galperin, M. Y. (2005). "The Molecular Biology Database Collection: 2005
update." Nucleic Acids Res. 33: D5 - D24.
Lacroix, Z. (2003). Public data sources and applications used by scientists,
Lacroix, Z., Kaushal Parekh, Maria-Esther Vidal, Marelis Cardenas, Natalia
Marquez (2005). BioNavigation: Selecting Optimum Paths through Biological Resources to Evaluate Ontological Navigational Queries. Data Integration in the Life Sciences.
Lacroix, Z., L. Raschid, and B.A. Eckman (2004). "Exploiting Biomolecular
Source Capabilities for Query Optimization." Journal of Bioinformatics and Computational Biology 2(2): 375-411.
Lawson, A. (1995). Studying for Biology, Addison-Wesley Educational
Lord, P., Bechhofer S., Wilkinson MD., Schiltz G., Gessler D., Hull D., Goble C.,
Stein L. (2004). Applying semantic web services to Bioinformatics: Experiences gained, lessons learnt . ISWC, Springer-Verlag.
Stajich, J. E., Block D., Boulez K., Brenner SE., Chervitz SA., Dagdigian C.,
Fuellen G., Gilbert JG., Korf I., Lapp H., Lehvaslaiho H., Matsalla C., Mungall CJ., Osborne BI., Pocock MR., Schattner P., Senger M., Stein LD., Stupka E., Wilkinson MD., Birney E. (2002). "The Bioperl toolkit: Perl modules for the life sciences." Genome Res. 12(10): 1611-8.
Stevens, R., Goble, C., Baker, P., and Brass, A. (2001). "A Classification of
Tasks in Bioinformatics." Bioinformatics 17(2): 180-188.
Wilkinson, M. D., D. Gessler, A. Farmer, L. Stein (2003). The BioMOBY Project
Explores Open-Source, Simple, Extensible Protocols for Enabling Biological Database Interoperability. Virt Conf Genom and Bioinf.
JAPI - DIPSI Guidelines Gestational Diabetes Mellitus – Guidelines* V Seshiah, AK Das, Balaji V, Shashank R Joshi, MN Parikh, Sunil Gupta For Diabetes In Pregnancy Study Group (DIPSI)+ Abstract The Diabetes In Pregnancy Study group India (DIPSI) is reporting practice guidelines for GDM in the Indian environment. Due to high prevalence, screening is essential for all Indian pregnan