[IGNORE THE FOLLOWING, it was all due to my incorrect IBM float conversion]
Going over the Pfizer data dump at https://phmpt.org/pfizers-documents/, some .xpt files caught my attention. It’s not a format I’m familiar with, so I googled around and found some information on it, but no easily-runnable program that would convert them into, say, a .csv (comma-separated values) file. So I started my own, at https://github.com/jcomeauictx/xport9.
It’s still buggy; I’ve since found that https://pypi.org/project/xport/, which I had seen in my first search but hadn’t looked deeply enough into, will do the job with python3 -m xport /tmp/long.xpt /tmp/long.csv
, and the numbers come out more obviously correct than mine do, but the dates and times are all wrong. They show up as numbers rather than something recognizable as a date or time.
But in the process, I’ve discovered what seems to me to be deliberate obfuscation, possibly via running a customized copy of the SAS software. Why else would there be 3 different encodings for TIME
when one would suffice, and all 3 different from either an integer or a float, the only two “number” types I’ve been able to discover in SAS via web search? I was able to figure it out thanks to years of reverse-engineering experience, looking at the raw hexadecimal and knowing that the results have to be somewhere between the years 2019 and 2022. It also was extremely fortunate that there are also DATETIME
fields which combine the date and time, giving me a reference (once I de-obfuscated that encoding) for the time fields.
The first, and most common, is in hex 44c8dc0000000000. The 44, “D” in ASCII, is simply a “flag” byte, and the c8dc is the number of seconds in hexadecimal to add to midnight. It comes to 14:17:00. But the largest number obtainable with this encoding is 0xffff, 65535, which only brings us to 18:12:15 in the evening. So, there’s the form 45101580000000, which turns out to require shifting the 2nd through 4th bytes to the right by 4, same as dividing by 16: 0x101580 » 4 = 0x10158
, decimal 65880, which gives 18:18:00. This one form would be sufficient, but there’s even a third: 433fc00000000000, which means take the 0x3fc0 and shift to the right 4 places (or divide by 16), and add that number of seconds to midnight, giving in this case 17 minutes exactly.
As soon as I figure out what I botched in my floating point code, I should have a usable script. But I won’t be surprised if I find other attempts at obfuscation, or at minimum, unnecessarily complex encoding schemas.
To the two who "liked" this, sorry. I was wrong.
What's documented in that file?
I recently created timeseries from all the trial doses by state. Ofc the tables in the investigators pdf were all printed, despite character-based pdf's requiring less space.
There's probably a lot of interesting stuff in those files. Alas I am busy with a large number of very large datasets that need all my attention, so please do let us know what you find.