Reverse Engineer Apache Jackrabbit Setup

anthonyh

We have a system that uses Apache Jackrabbit as an image (document) storage repository. We would really like to be able to pull documents for use with applications outside said system. The vendor of the system is, of course, not willing to volunteer how we can do this. So, I've been asked to reverse engineer it. I've looked at the database (MS-SQL) that's being used as storage and, yeah, I need to get into it from the Jackrabbit side...

Anyone have any pointers on resources to help me with this? At least a pointer on where to start?

It goes without saying, I have no clue what I'm doing.

MattSpeller

Upvoted for ambition + visibility

gjacobse

well good thing is that it's Open Source and runs on apache.

Sounds like a @scottalanmiller question.

JaredBusch

Jackrabbit has an API. Why go into the DB when you can use the API?

Ambarishrh

I am not sure if this is helpful, but a search got me this http://blog.mooregreatsoftware.com/

Part of that blog:
Sadly, the metadata files for AEM Package Manager are very, very poorly documented. To make matters worse, there is a lot of duplication and inconsistencies between them. There is a little bit of information at the Apache Jackrabbit FileVault Documentation site, but it is focussed at the Vault filesystem and the like, not specifically how to use packages. The Adobe 6.1 Package Manager documentation discusses creating a package through the UI, but doesn’t discuss any of the mechanics. The Maven VLT plugin talks a little about how to set up Maven, but has huge holes in what is actually done and what the values really mean.

In an effort to get some better understanding, I’ve done a lot of reading, testing, and reverse engineering to come up with the following information. If anyone knows where I can learn more, I’d love to know and pass that along!

Not sure if it completely talks about Apache Jackrabbit, but thought this might help.

And another one; talks about exporting data as XML
https://wiki.apache.org/jackrabbit/BackupAndMigration

scottalanmiller

@gjacobse said in Reverse Engineer Apache Jackrabbit Setup:

well good thing is that it's Open Source and runs on apache.

Sounds like a @scottalanmiller question.

LOL, yes. Jackrabbit itself is fully open. No reverse engineering needed. You can look right at the code or docs.

scottalanmiller

So the MS SQL Server database is overly complex? Hard to believe that the image data is not relatively easy to find in there.

anthonyh

@scottalanmiller said in Reverse Engineer Apache Jackrabbit Setup:

So the MS SQL Server database is overly complex? Hard to believe that the image data is not relatively easy to find in there.

The SQL database appears to be fairly simple. However, it's not in any easy-for-a-human-to-decipher structure (at least this human).

For what it's worth, we used to have a system that used IBM's FileNet for document storage...and I easily reverse engineered the Oracle back-end of that and was able to pull docs from that with no issues.

This is nothing like FileNet, unfortunately.

tiagom

@anthonyh said in Reverse Engineer Apache Jackrabbit Setup:

@scottalanmiller said in Reverse Engineer Apache Jackrabbit Setup:

So the MS SQL Server database is overly complex? Hard to believe that the image data is not relatively easy to find in there.

The SQL database appears to be fairly simple. However, it's not in any easy-for-a-human-to-decipher structure (at least this human).

For what it's worth, we used to have a system that used IBM's FileNet for document storage...and I easily reverse engineered the Oracle back-end of that and was able to pull docs from that with no issues.

This is nothing like FileNet, unfortunately.

Of course, its so you pay them to do whatever customization you are after.

Sadly i have no experience with Apache Jackrabbit. Hope you figure this out!

anthonyh

I think I may go down a less elegant, but something I can put together more quickly, method.

I discovered that once I'm logged into the system (it's web based), I can simply browse to the document retrieval URL and stick the appropriate document ID in said URL. This will spit out said document.

I can script this via Lynx on a Linux VM relatively easily.

All we need to do is dump the desired document IDs to a list that I can then read on the Lynx side and, boom, we'll have the docs to do with as we please.

dafyre

@anthonyh said in Reverse Engineer Apache Jackrabbit Setup:

I think I may go down a less elegant, but something I can put together more quickly, method.

I discovered that once I'm logged into the system (it's web based), I can simply browse to the document retrieval URL and stick the appropriate document ID in said URL. This will spit out said document.

I can script this via Lynx on a Linux VM relatively easily.

All we need to do is dump the desired document IDs to a list that I can then read on the Lynx side and, boom, we'll have the docs to do with as we please.

You could also browse the database tables and figure out where said document IDs live, that way you can simply pull straight from the DB.

anthonyh

@dafyre said in Reverse Engineer Apache Jackrabbit Setup:

@anthonyh said in Reverse Engineer Apache Jackrabbit Setup:

I think I may go down a less elegant, but something I can put together more quickly, method.

I discovered that once I'm logged into the system (it's web based), I can simply browse to the document retrieval URL and stick the appropriate document ID in said URL. This will spit out said document.

I can script this via Lynx on a Linux VM relatively easily.

All we need to do is dump the desired document IDs to a list that I can then read on the Lynx side and, boom, we'll have the docs to do with as we please.

You could also browse the database tables and figure out where said document IDs live, that way you can simply pull straight from the DB.

If I could do that, I would. The DB is in no way/shape/form readable by anything other than Jackrabbit. This was just confirmed by the vendor of the system. They actually just suggested exactly what I'm working on doing (after my boss had what he calls a "come to Jesus" moment with them).

travisdh1

@anthonyh said in Reverse Engineer Apache Jackrabbit Setup:

@dafyre said in Reverse Engineer Apache Jackrabbit Setup:

@anthonyh said in Reverse Engineer Apache Jackrabbit Setup:

I think I may go down a less elegant, but something I can put together more quickly, method.

I discovered that once I'm logged into the system (it's web based), I can simply browse to the document retrieval URL and stick the appropriate document ID in said URL. This will spit out said document.

I can script this via Lynx on a Linux VM relatively easily.

All we need to do is dump the desired document IDs to a list that I can then read on the Lynx side and, boom, we'll have the docs to do with as we please.

You could also browse the database tables and figure out where said document IDs live, that way you can simply pull straight from the DB.

If I could do that, I would. The DB is in no way/shape/form readable by anything other than Jackrabbit. This was just confirmed by the vendor of the system. They actually just suggested exactly what I'm working on doing (after my boss had what he calls a "come to Jesus" moment with them).

Hrm, let me guess, they're storing entire tables of values from PHP in single database columns? That is so very highly annoying, and goes against everything relational databases are supposed to be. I've had bad experiences with this in Drupal myself.

anthonyh

@travisdh1 said in Reverse Engineer Apache Jackrabbit Setup:

@anthonyh said in Reverse Engineer Apache Jackrabbit Setup:

@dafyre said in Reverse Engineer Apache Jackrabbit Setup:

@anthonyh said in Reverse Engineer Apache Jackrabbit Setup:

I think I may go down a less elegant, but something I can put together more quickly, method.

I discovered that once I'm logged into the system (it's web based), I can simply browse to the document retrieval URL and stick the appropriate document ID in said URL. This will spit out said document.

I can script this via Lynx on a Linux VM relatively easily.

All we need to do is dump the desired document IDs to a list that I can then read on the Lynx side and, boom, we'll have the docs to do with as we please.

You could also browse the database tables and figure out where said document IDs live, that way you can simply pull straight from the DB.

If I could do that, I would. The DB is in no way/shape/form readable by anything other than Jackrabbit. This was just confirmed by the vendor of the system. They actually just suggested exactly what I'm working on doing (after my boss had what he calls a "come to Jesus" moment with them).

Hrm, let me guess, they're storing entire tables of values from PHP in single database columns? That is so very highly annoying, and goes against everything relational databases are supposed to be. I've had bad experiences with this in Drupal myself.

No, it's not doing that. What it's doing kinda makes sense (at least from the limited sleuthing knowledge I have), it's just organized for Jackrabbit and not for a human. There are 6 tables:

GOBAL_REVISION - Not sure what this is, we only have one record here. I believe it has to do with clustering (there are 4 app servers and Jackrabbit runs on each app).
JOURNAL - I believe this is something to do with clustering as well.
BINVAL - Where the documents are stored, I believe. There are two colums, BINVAL_ID and BINVAL_DATA.
BUNDLE - Not sure what this is.
NAMES - A reference table for various object names.
REFS - Empty in our implementation.

From what I've researched, the docs are stored in hexidecimal format. However, when I pull the BINVAL_DATA field for a given record and convert from hex to binary, the file is unreadable. Even if I could successfully convert the doc, the IDs for these records do not correspond to the IDs that we see on the front-end. I have not found any sort of relationship table/list in the front-end database, I suspect it's all done via Jackrabbit.

travisdh1

@anthonyh said in Reverse Engineer Apache Jackrabbit Setup:

@travisdh1 said in Reverse Engineer Apache Jackrabbit Setup:

@anthonyh said in Reverse Engineer Apache Jackrabbit Setup:

@dafyre said in Reverse Engineer Apache Jackrabbit Setup:

@anthonyh said in Reverse Engineer Apache Jackrabbit Setup:

I think I may go down a less elegant, but something I can put together more quickly, method.

I discovered that once I'm logged into the system (it's web based), I can simply browse to the document retrieval URL and stick the appropriate document ID in said URL. This will spit out said document.

I can script this via Lynx on a Linux VM relatively easily.

All we need to do is dump the desired document IDs to a list that I can then read on the Lynx side and, boom, we'll have the docs to do with as we please.

You could also browse the database tables and figure out where said document IDs live, that way you can simply pull straight from the DB.

If I could do that, I would. The DB is in no way/shape/form readable by anything other than Jackrabbit. This was just confirmed by the vendor of the system. They actually just suggested exactly what I'm working on doing (after my boss had what he calls a "come to Jesus" moment with them).

Hrm, let me guess, they're storing entire tables of values from PHP in single database columns? That is so very highly annoying, and goes against everything relational databases are supposed to be. I've had bad experiences with this in Drupal myself.

No, it's not doing that. What it's doing kinda makes sense (at least from the limited sleuthing knowledge I have), it's just organized for Jackrabbit and not for a human. There are 6 tables:

GOBAL_REVISION - Not sure what this is, we only have one record here. I believe it has to do with clustering (there are 4 app servers and Jackrabbit runs on each app).
JOURNAL - I believe this is something to do with clustering as well.
BINVAL - Where the documents are stored, I believe. There are two colums, BINVAL_ID and BINVAL_DATA.
BUNDLE - Not sure what this is.
NAMES - A reference table for various object names.
REFS - Empty in our implementation.

From what I've researched, the docs are stored in hexidecimal format. However, when I pull the BINVAL_DATA field for a given record and convert from hex to binary, the file is unreadable. Even if I could successfully convert the doc, the IDs for these records do not correspond to the IDs that we see on the front-end. I have not found any sort of relationship table/list in the front-end database, I suspect it's all done via Jackrabbit.

VINVAL_DATA is probably the raw jpg/gif/whatever, I'd be surprised if you needed to convert it.

Overall, Jackrabbit sounds like it was designed horribly, and you've found the best option out of the bad choices you have

JaredBusch

@anthonyh said in Reverse Engineer Apache Jackrabbit Setup:

I have not found any sort of relationship table/list in the front-end database, I suspect it's all done via Jackrabbit.

This is obviously not true. There will be a record someplace that contains all of the cross references or there would be no way for anything to be pulled out after it was stored. This is just silly reasoning. Just because you do not know where to find it does not mean it does not exist.

That said, I told you all the way at the beginning of this thread to use the native API to pull documents instead of trying to kludge some hack together. That is the entire point of having an API.

dafyre

Compare ID fields in the NAMES and BINVAL tables... A system like this is not likely to have the correct information in one place.

anthonyh

@JaredBusch said in Reverse Engineer Apache Jackrabbit Setup:

@anthonyh said in Reverse Engineer Apache Jackrabbit Setup:

I have not found any sort of relationship table/list in the front-end database, I suspect it's all done via Jackrabbit.

This is obviously not true. There will be a record someplace that contains all of the cross references or there would be no way for anything to be pulled out after it was stored. This is just silly reasoning. Just because you do not know where to find it does not mean it does not exist.

That said, I told you all the way at the beginning of this thread to use the native API to pull documents instead of trying to kludge some hack together. That is the entire point of having an API.

I am pretty knowledgeable about the non Jackrabbit side of this application, and I am going to say you're wrong. I'm confident the relationship is stored on the Jackrabbit side and NOT the front-end side.

Yes, Jackrabbit has an API (I am fully aware of this). I looked at their "First Hops" exercise (making a connection to Jackrabbit), and you need to know about the JCR specification and how to program in Java. I do not have these skill sets (yet).

http://jackrabbit.apache.org/jcr/first-hops.html

anthonyh

@dafyre said in Reverse Engineer Apache Jackrabbit Setup:

Compare ID fields in the NAMES and BINVAL tables... A system like this is not likely to have the correct information in one place.

Unfortunately the NAMES table has a total of 10 records. It's not document names (good guess, though!).

0_1481232011012_upload-c2105240-a37a-4ca8-8652-1b16bc475f44

anthonyh

@travisdh1 said in Reverse Engineer Apache Jackrabbit Setup:

@anthonyh said in Reverse Engineer Apache Jackrabbit Setup:

@travisdh1 said in Reverse Engineer Apache Jackrabbit Setup:

@anthonyh said in Reverse Engineer Apache Jackrabbit Setup:

@dafyre said in Reverse Engineer Apache Jackrabbit Setup:

@anthonyh said in Reverse Engineer Apache Jackrabbit Setup:

I think I may go down a less elegant, but something I can put together more quickly, method.

I discovered that once I'm logged into the system (it's web based), I can simply browse to the document retrieval URL and stick the appropriate document ID in said URL. This will spit out said document.

I can script this via Lynx on a Linux VM relatively easily.

All we need to do is dump the desired document IDs to a list that I can then read on the Lynx side and, boom, we'll have the docs to do with as we please.

You could also browse the database tables and figure out where said document IDs live, that way you can simply pull straight from the DB.

If I could do that, I would. The DB is in no way/shape/form readable by anything other than Jackrabbit. This was just confirmed by the vendor of the system. They actually just suggested exactly what I'm working on doing (after my boss had what he calls a "come to Jesus" moment with them).

Hrm, let me guess, they're storing entire tables of values from PHP in single database columns? That is so very highly annoying, and goes against everything relational databases are supposed to be. I've had bad experiences with this in Drupal myself.

No, it's not doing that. What it's doing kinda makes sense (at least from the limited sleuthing knowledge I have), it's just organized for Jackrabbit and not for a human. There are 6 tables:

GOBAL_REVISION - Not sure what this is, we only have one record here. I believe it has to do with clustering (there are 4 app servers and Jackrabbit runs on each app).
JOURNAL - I believe this is something to do with clustering as well.
BINVAL - Where the documents are stored, I believe. There are two colums, BINVAL_ID and BINVAL_DATA.
BUNDLE - Not sure what this is.
NAMES - A reference table for various object names.
REFS - Empty in our implementation.

From what I've researched, the docs are stored in hexidecimal format. However, when I pull the BINVAL_DATA field for a given record and convert from hex to binary, the file is unreadable. Even if I could successfully convert the doc, the IDs for these records do not correspond to the IDs that we see on the front-end. I have not found any sort of relationship table/list in the front-end database, I suspect it's all done via Jackrabbit.

VINVAL_DATA is probably the raw jpg/gif/whatever, I'd be surprised if you needed to convert it.

Overall, Jackrabbit sounds like it was designed horribly, and you've found the best option out of the bad choices you have

Looks like BINVAL_DATA is a byte array type. This link below, though not Jackrabbit specific, shows how to convert between a file and byte array.

http://www.programcreek.com/2009/02/java-convert-a-file-to-byte-array-then-convert-byte-array-to-a-file/