Batch Processing PDB Files

These steps are target for use with my VR simulation system which makes use of CONNECT records for all bonds, and benefits from adding hydrogen atoms to the PDB files.

Step 1: Overview of accessing large amounts of PDB files

In the following link here are instructions for automated downloading PDB files, whether that is all of them or selectively choosing major directories.

https://www.rcsb.org/docs/programmatic-access/file-download-services

Note: The older legacy PDB format has several limitations, therefore “The primary data format for PDB data is the PDBx/mmCIF format.” See the following page for more information.

https://www.rcsb.org/docs/general-help/structures-without-legacy-pdb-format-files

Browsing the files via the web: https://files.wwpdb.org

Step 2: Downloading

Option 1: Normal Use

Download many files manually from the website, or collect from collaborators, that you wish to use. Then place all of the files in a folder to be read by the scripts later. If they are not compressed you can skip the PowerShell script.

Option 2: Automated Downloading of Many Files

I chose to use the application WinSCP; however, “Please note that the FTP protocol will be phased out on November 1st 2024.”

Within WinSCP I used the host name ftp://snapshots.rcsb.org to download a snapshot.

Once connected you will see something like the image below and you can choose to download a snapshot folder for a given file type. However, this will download 10s of GB over a limited connection. Which another method like wget might be better.

I chose to download a small number of the subfolders in the folder all_pdb_files_20210105. Maybe pick the most up to date folder instead though.

Step 3: Required Software

Software used on windows

Setup of PowerShell

Using administrator PowerShell check the execution policy

Get-ExecutionPolicy

If it says Restricted, we will need to change it to allow PowerShell scripts to run. The recommended setting is RemoteSigned. This setting allows scripts that are created on your local machine to run without a digital signature, but any script downloaded from the internet must be signed by a trusted publisher. Retaining some protection.

Set-ExecutionPolicy RemoteSignedCode language: JavaScript (javascript)

Setup of Other Software

For PyMOL you can go to their site and download / install easily.

For python I leave this to you, as it may change over time. I installed python directly to windows, not in a Conda environment.

Step 4: Processing, depending on your needs

The snapshots are organized into a file structure with compressed files.

In the case of the pdb snapshot the files will looks something like this:

pdb20gs.ent.gzCode language: CSS (css)

They will need to be unzipped and the .ent ending can be changed to .pdb.

Archive information: “The archival PDB files will be distributed with the reserved conventional names, in the form pdbentry_id.ent, where entry_id is a PDB 4-letter code, e.g. pdb1abc.ent, for PDB format entries; rentry_idsf.ent, e.g. r1abcsf.ent, for X-ray experimental data; entry_id.mr, e.g. 1abc.mr, for NMR experimental/constraints; entry_id.cif, e.g. 1abc.cif, for mmCIF format entries; and entry_id.xml, e.g. 1abc.xml, for canonical XML format entries.” https://www.wwpdb.org/about/faq

Step 4.1: Processing the PDB Format Snapshot

First we will need to create some files. Some of these may not be needed for your use case.

** WARNING: YOU WILL LIKELY NEED TO ADJUST ALL OF THE PATHS BELOW. **

You can use control + f to find them by searching for “C:\”

File 1: Unzipping PDB Snapshot

File name: unzipRename.ps1

# Add this function at the beginning of your script to handle .gz extraction
function Expand-GZipFile {
    param(
        [string]$inputPath,
        [string]$outputPath
    )
    
    $buffer = New-Object byte[](1024 * 1024) # Buffer size
    try {
        $gzipStream = New-Object System.IO.Compression.GzipStream([System.IO.File]::OpenRead($inputPath), [System.IO.Compression.CompressionMode]::Decompress)
        $outputFileStream = New-Object System.IO.FileStream($outputPath, [System.IO.FileMode]::Create)
        
        while ($true) {
            $read = $gzipStream.Read($buffer, 0, 1024 * 1024)
            if ($read -le 0) { break }
            $outputFileStream.Write($buffer, 0, $read)
        }
    } catch {
        Write-Error "Failed to process file '$inputPath': $_"
    } finally {
        if ($null -ne $gzipStream) {
            $gzipStream.Close()
        }
        if ($null -ne $outputFileStream) {
            $outputFileStream.Close()
        }
    }
}

# Set the root directory where the gz files are stored
$rootDirectory = "C:\PDB\all_pdb_files_20210106"

# Set the output directory where the unzipped files will be stored
$outDirectory = "C:\PDB\all_pdb_files_20210106_unzipped"

# Create the output directory if it does not exist
if (-not (Test-Path -Path $outDirectory)) {
    New-Item -ItemType Directory -Path $outDirectory
}

# Find all .gz files recursively in the root directory
$gzFiles = Get-ChildItem -Path $rootDirectory -Filter *.gz -Recurse

# Loop through each file found
foreach ($file in $gzFiles) {
    # Construct full paths for input and output
    $currentFile = $file.FullName
    $relativePath = $file.FullName.Substring($rootDirectory.Length)
    $destinationPath = Join-Path -Path $outDirectory -ChildPath $relativePath
    $destinationDirectory = [System.IO.Path]::GetDirectoryName($destinationPath)

    # Ensure the destination directory exists
    if (-not (Test-Path -Path $destinationDirectory)) {
        New-Item -ItemType Directory -Path $destinationDirectory
    }

    # Generate the correct destination file path by replacing .gz with an empty string
    $destinationFile = $destinationPath -replace '\.gz$', ''
    
    # Call the function to expand .gz files
    # Write-Output "Expanding $currentFile to $destinationFile..."

    Expand-GZipFile -inputPath $currentFile -outputPath $destinationFile

    # Check if the unzipped file has an .ent extension and then rename it to .pdb
    if ($destinationFile -like '*.ent') {
        $pdbFile = $destinationFile -replace '\.ent$', '.pdb'
        Rename-Item -Path $destinationFile -NewName $pdbFile
        
        # Write-Output "Renamed to $pdbFile"
    } else {
        # Write-Output "No need to rename $destinationFile"
    }
}

Write-Output "All files extracted."Code language: PHP (php)

File 2: Processing the PDB Files, Adding Hydrogens and CONNECT

File name: addHydrogenAndConnect.py

import pymol
from pymol import cmd
import os
import sys

# Open the output file in write mode
log_file = open('C:\\PDB\\pymol_output.txt', 'w')
sys.stdout = log_file
sys.stderr = log_file  # Also redirecting stderr to capture any errors

# Example of printing something immediately to test
print("\nStart of PyMOL script log")

# Initialize global counters
filesFound = 0
filesOutput = 0
filesFailed = 0

# Initialize PyMOL
pymol.finish_launching()

# Directory containing the PDB files
directory = 'C:\\PDB\\all_pdb_files_20210106_unzipped'

# Destination directory for output files
destination_directory = 'C:\\PDB\\all_pdb_files_20210106_full'

# Check and create destination directory if it does not exist
if not os.path.exists(destination_directory):
    os.makedirs(destination_directory)

# Set PDB output to include CONNECT records
cmd.set('pdb_conect_all', 'on')

# Function to recursively walk through the directory tree
def process_directory(current_directory):
    global filesFound, filesOutput, filesFailed  # Declare global to modify the counters
    for root, dirs, files in os.walk(current_directory):
        for filename in files:
            if filename.endswith(".pdb"):  # Check for PDB files
                file_path = os.path.join(root, filename)
                cmd.load(file_path, 'molecule')
                # print(f"Loaded {filename}")
                filesFound += 1

                try:
                    # Add hydrogens
                    cmd.h_add('molecule')
                    # print(f"Added hydrogens to {filename}")

                    # Save the modified file with CONNECT records
                    new_file_path = os.path.join(destination_directory, filename)
                    cmd.save(new_file_path, 'molecule', format='pdb')
                    # print(f"Saved {filename}")
                    filesOutput += 1

                    # Remove the molecule from PyMOL to clear memory for the next one
                    cmd.delete('molecule')
                except Exception as e:
                    print(f"Failed to process {filename}: {e}")
                    cmd.delete('molecule')
                    filesFailed += 1

# Process all files in the directory tree
process_directory(directory)

# Output the results
if filesFound > 0:
    successRate = (filesOutput / filesFound) * 100
else:
    successRate = 0

print(f"Files Found: {filesFound}, Files Processed: {filesOutput}, Files Failed: {filesFailed}, Success Rate: {successRate}%")

# Quit PyMOL
cmd.quit()

# Remember to close the log file at the end of your script
log_file.close()Code language: PHP (php)

File 3: A Batch File to Run the Files Above via a PowerShell Terminal

File name: processPDB.bat

@echo off
echo Running PowerShell script...
powershell.exe -NoProfile -File "C:\PDB\unzipRename.ps1"
echo PowerShell script completed.

echo Running PyMOL script...
"C:\ProgramData\pymol\PyMOLWin.exe" -c -d "run C:\PDB\addHydrogenAndConnect.py"
echo PyMOL script completed.

echo Output from PyMOL script:
type C:\PDB\pymol_output.txt

echo ---------------------------------
echo Both scripts have been executed!
echo ---------------------------------Code language: PHP (php)

Step 4.2: Processing the mmCIF Format Snapshot

To be continued… It would be similar though.

Conclusion

Now the files are better suited to use in my VR simulation system.

Non-Automated Method

This processing can also be done in a non automated manner if you only need one file.

  1. Download the PDB file needed
  2. Open it with PyMOL
  3. Add hydrogens
  4. File > Export Molecule…
  5. Compute atomic charges with https://acc2.ncbr.muni.cz/ [Not covered here]

Contact me if you would like to use the contents of this post. Thanks 🙂
Copyright © 2024 by Gregory Gutmann

Close Menu