PEPR Home  
        
spacer image Horizontal grey line spacer image
spacer image
spacer image
spacer image
spacer image
Background
spacer image
Design Goals
spacer image
Data Repository
 
Search Repository
spacer image
PEPR Tutorial
spacer image
Resources
  spacer image    
spacer image spacer image spacer image
New User Registration
spacer image
spacer image
spacer image
spacer image Username: spacer image
spacer image spacer image
spacer image
spacer image Password: spacer image
spacer image spacer image
spacer image
spacer image Forgot Password? spacer image
spacer image
spacer image
spacer image
spacer image spacer image spacer image
Home
spacer image
spacer image
spacer image
spacer image
 spacer image
spacer image

PEPR Design Goals

spacer image

The PEPR re-design and implementation goals

Second generation release, April 2004

From the laboratory's initial experience with about 4000 Affymetrix profiles, with both independent and referral projects (approximately 100 projects to date), we had learned a great deal on work-flow and databasing of expression profiling data. We then set new goals of improving the existing workflow, enforcing data consistency and enhancing the current Microarray website functionalities. A major objective is to provide better research sharing, data mining and visualization (e.g. a major aim of the emerging NIH RoadMap ).

Funding for the re-design and implementation of the second generation PEPR database was provided by a NIH NINDS contract for spinal cord damage models , a Department of Defense grant for public access resources, and a NIH NHGRI R21/R33 grant.

The second generation of the PEPR resource was designed to provide a stream-lined and integrated workflow process of proposal submission, data generation/interpretation, expression profile publishing and data visualization. A web-based proposal submission/approval workflow engine was incorporated to replace the previous email exchange process. The new PEPR workflow engine stores all submitted/approval proposals, along with revision history, into a central Oracle repository. This integrated design makes it considerably easier and more efficient to develop dynamic query tools aimed towards specific audiences/users (see below).

Figure. Second generation PEPR workflow diagram.

The re-design described below enables rich meta-data search functions (i.e. search by experiment design type or animal model's age, sex); a web-interface data input system is used to capture experiment information. Unlike other currently utilized profiling packages, our web interface data input submission process offers great flexibility to obtain desired experiment meta-data (e.g. addition of experiment design type) for analysis and visualization. It provides a mechanism to enforce data input consistency and validation, and eliminates the current accessory tables and batch process to filter data. The data consistency expands the search and visualization capabilities.

Affymetrix GCOS operating system and AADM database is provided with all Affymetrix packages. However, rather than accessing the AADM database directly, our application utilizes the Affymetrix GCOS and GDAC SDK (software developer kit) to retrieve and parse experiment related data (e.g. .chp, .cel files). It preprocesses all the published chip files to improve the data download performance. It eliminates the existing process to transfer large sets of experiment data from lab database to public database. With GCOS and GDAC SDK, only a small subset of the data is extracted and placed in public database for analysis at any point in time. It also eliminates the AADM dependency (no need to change application if the AADM schema is changed). Indeed, the often-changing AADM schema resulted in chronic compatibility problems with the first generation PEPR resource.

PEPR also utilizes our newly implemented GEO submitted or update API's to submit new experiments or revised previously-published experiment data. PEPR incorporates a custom-designed Probe Profiler API (funded by a Department of Defense grant for PEPR to Dr. Hoffman), to offer four additional data algorithms (DCHP Diff, DCHP PMOnly, RMA, and PCA), in addition to the built-in MAS algorithm values for data analysis and visualization. Finally PEPR provides off-line batch data exportation that allows the researcher download/export a series of large data set while continuing to navigate the site. The generation of .chp, .dat and .cel data files is processed during off-peak hours.

Our previous design and implementation of PEPR was supported by an NHLBI Programs in Genomic Applications grant, and an NINDS Spinal Cord Trauma grant (the latter the single NIH-award for this contract). While we have only very recently reported our initial implementation of PEPR (Almon et al. 2003; Chen et al. 2004), we feel our new re-design (funded by the Department of Defense and a R21/R33 NHGRI grant) makes substantial improvements over our previous version, and any other dynamic query resource for massively parallel and multi-dimensional biological datasets available elsewhere.

The major improvements of PEPR while comparing the previous application include:

  • proposal submission/approval workflow
  • expanded search
  • expanded data visualization
  • data retrieval preprocess through GCOS and GDAC SDK
  • GEO publishing addition and update
  • Off-line batch data exportation

The major benefits of the PEPR while comparing the previous application:

  • Workflow and central repository improves the collaboration between researchers and investigators.
  • Enhanced search features offers better data sharing and navigation
  • Enhanced visualization offers better assistances to researchers
  • GCOS and GDAC SDK utilization eliminates the AADM dependency
  • GEO publishing update completes the existing GEO publishing process (experiment addition and modification) through browser-based. It empowers the scientists to manage their own experiment data
  • Off-line batch data exportation provides faster system response to researchers
  • Data validation and consistency make database maintenance and operation easier
  • OOD technology implementation make maintenance and future enhancement easier

The PEPR process architecture design and implementation

PEPR is a three-tier Java enterprise application, composed of a Web Tier, Middle Tier and Back-End Tier . A schematic of the overall design is provided on the next page of this text. Note that the current version of PEPR ( http://microarray.cnmcresearch.org ) (Chen et al. 2004) will be replaced with the version described below, at http://pepr.cnmcresearch.org , over the next few weeks (prior to meeting of the study section).

Web Tier

Web Tier includes a web server, a Tomcat application server and various web components which provide front end functionalities such as navigation, data browsing, data searching, project submission, project publishing, gene query tool and user notification. Most of web components interface transparently with PEPR back-end databases. This tier's interface allows users to trigger the middle tier application.

 Middle Tier

The Middle Tier is integrated with several third party services, some of which we have purchased enterprise versions of pre-existing software, and others we wrote or contracted specifically for PEPR (Popchart, Lucene, Affymetrix SDK and Corimbia Probe Profiler SDK). It is designed to handle time-consuming processes such as Affymetrix data extraction, offline data downloading while allowing user to navigate the site without waiting the completion of the process. The Middle Tier applications require intense computing resources and are responsible for chart visualization generation, offline data download, metadata indexing for keyword search, NCBI GEO data submission; Affymetrix data file extraction and transformation, and Probe Profiler mixture of algorithm data generation.

Most of processes in this tier do not require synchronous response from the PEPR front-end. In addition to the conventional web click-and-wait applications features, PEPR allows user to submit the request without waiting the completion of the process while the process is guaranteed to be completed. To achieve this asynchronous operation in a reliable manner, an Open JMS queue server is introduced in PEPR implementation, and this serves to enhance the PEPR application functionalities. JMS is designed to handle the messages delivery between web components. When a user submits a request to download a large set of data in PEPR, a web component in Tomcat application server packages the user's request to a message and drops the message into the JMS Queue. The JMS Queue is responsible for receiving and delivering the message as a specialized router that looks at the message's address and delivers it to the appropriate parties (i.e. Offline Data Download process in the chart). The Offline Data Download process then parses and handles the download request. It continues to search and compress the requested data, and then send out the download URL notification to the user. During this process, the user does not have to wait for the lengthy file compression process completion. . The JMS Queue makes the batch download possible.

The importance of PEPR JMS Queue service:

  • Asynchronous communication: JMS Queue serves as an asynchronous communication channel between Web Tier and the Middle Tier components. When a PEPR administrator issues a GDAC data export command, the interface drops the message into JMS Queue and triggers the Affymetrix GDAC process, the process further loads data into the PEPR database while the administrator continues to perform other tasks.
  • Reliable messaging communication: JMS Queue stores all the messages in Oracle database permanently. In the event of shutting down Middle Tier processes due to unexpected software failure, the JMS Queue continues to store and buffer the messages delivered from Tomcat application server. The JMS Queue then delivers the stored messages to the appropriate process when the Middle Tier applications restart. The persistence of JMS Queue provides PEPR high availability.
  • Distributed computing: Probe Profiler API process requires intense computing resources. PEPR uses JMS Queue to distribute the computing resources to different server. JMS Queue is used to communicate with Probe Profiler API process (residing on CRI7) remotely. It allows the remote process to receive the messages and start its own calculation.

Sequence process control: Probe Profiler API is designed as single thread model; it can only process one request at a time. If more than one Probe Profiler processes are triggered at the same time, the second request would be dropped. JMS Queue can guarantee the arrival of the message and delivery of the message sequentially to the Probe Profiler API process on a first-come first served basis.

Figure. PEPR architecture. (click on the image for details)

 

Back-End Tier

The Back-End Tier is composed of two databases; the PEPR DB and the Affymetrix LIMS DB. PEPR DB stores all sorts of metadata of projects and experiments alone with associated analysis value for real-time data mining purposes. The Affymetrix LIMS DB stores all Affymetrix expression profiling physical data and chipping process information.

We do not have adequate space to describe all the interfaces of PEPR, however we provide one screen snapshot of one interface (see following page). In this instance, we show dynamic query of a 27-time point time series project (see Zhao et al. 2002, 2003, 2004; Almon et al. 2003; Chen et al. 2004) (note that only 16 time points are shown in this example).

As can be seen, there are different specialized interfaces for the different types of users (left menu bars). Here, a web-based user has used a genome-browser type function to identify genes in the genome matching his/her query (e.g. “myogenic”), then used drop-down menus to select the specific gene and probe set that they wished to visualize. Multiple genes can be sent to be co-graphed; here, two myogenic factor genes were selected. The user can then define the probe set algorithm that should be visualized; here the user selected four of the available probe set algorithms. This dynamic query tool then extracts all data from the profiles, visualizes replicates, derives the averages of the replicates on the fly, graphs the genes relative to each other, provides mouse-overs showing all data behind that data point (including fold-change relative to time 0), and spreadsheets can be downloaded containing all data in the selected graphs. As can also be seen, different probe set algorithms provide quite different interpretations, as we have previously reported (Seo et al. 2003).

The goal of the P20 application is to extend our web Oracle PEPR database to include clinical, SNP, and possibly plasma proteomic data for both Aims 1 and 2, and the two pilot projects (Aims 4a, 4b). As noted above, this may sound quite ambitious, however we believe we have already demonstrated our expertise in this area, and can readily extend the current design and infrastructure to accommodate extended meta data fields.

Figure. Dynamic query web interface.

Grey dotspacer image
 
spacer image
spacer image Contact Us spacer image  Copyright © 2006 PEPR spacer image Funding spacer image
spacer image
  Version 2.0.0  
Site designed and built by Eric Hoffman & Josephine Chen
spacer image