Need a DB option for archiving ecommerce data



  • We are doing a huge migration of two stores on one platform into one store on a different platform.

    We're starting relatively "fresh" as far as importing anything old. Only the catalog and customers were copied, no order history and so forth.

    However, over the years I've been doing weekly backups downloading the CSV data that the previous stores made available. They had separate files for products, categories, customers, and orders. They also provide XML exports which actually include additional data not found in the CSVs.

    I need to take these hundreds of CSV files which are in 4 categories, across two stores, and convert them all into a type of single archive application for holding about 10 years worth of data. Sadly, they never wanted individual sales to be recording in accounting software (Quickbooks) so this is also acting like an accounting archive as well.

    My skills are in PHP and MySQL as far as building DB apps, but there are some nuances to this project where I might want to explore alternate databases.

    One thing is that as I did weekly backups, various bits of data might change, for example the price going up or down, or text in the description changing, or the URL being changed, etc. These are reflected in the history of my CSV files, but would be lost if I simply recorded the latest version of the product. Even an order may have changed from one week to the next if, for example, they added or removed an item from an order. Or the order was live and good but later cancelled. I don't want to lose the history of the order as recorded through my backups.

    As I think about this from the RDBS perspective I don't know that it's the best choice. It would be difficult to have "one row" for a product, but then ALSO record every time-stamped variation of any given field.
    I could save every backup as individual rows, but then every query would have to group things properly and filter out by date and so on. I think these queries would get very expensive always having to process dozens of rows of the same product and then parse out sorted alternative fields that might exist.

    In my brain this seems better suited to an object datastore where an "array" called "Description" could contain every dated alternate text. Then each product or order could still be a single master record and all the alternate fields would be within it.

    Normally storing customers, orders, and products would be pretty "relational" with a table for each one. But since I'd like to store the dated history of various fields within those, according to my backups, it complicates things a bit.

    I'm either trying to figure out a good RDBMS strategy for this without creating spaghetti queries, or considering more of an object store or graph-type DB. Maybe Mongo, etc.

    I also have to consider a method whereby all my CSVs can be processed and stored automatically since I don't want to enter anything manually, or even one file at a time.

    Then I'll have to build a basic GUI for finding and viewing the records.



  • Regardless of which type of DB you use, you could have the most recent information in a table... say... products.

    The history / old information could be stored in products_history or something like that.

    That way your most recent information is a simple query, and if you need to review a product's history, then you query the history tables?



  • What we use in finance is called a "tick" system. A column (this could be its own table in an RDBMS, or it could be stored in something like Redis) that stores changes over time. It would only need to record the product, datetime and price.



  • @scottalanmiller said in Need a DB option for archiving ecommerce data:

    What we use in finance is called a "tick" system. A column (this could be its own table in an RDBMS, or it could be stored in something like Redis) that stores changes over time. It would only need to record the product, datetime and price.

    The billing application that we wrote for a client works like that more or less.

    User wants to change prices, they select the products, set the effective time of the price change, set the new price, and the application save that to the pricing table in the DB.

    Prices generally change daily and are effective from 6pm to 5:59pm, but can change at anytime with new effective times.

    When invoices are generated, it queries the time of purchase from the transaction and then compares to the pricing table to find out which price to bill.



  • @scottalanmiller said in Need a DB option for archiving ecommerce data:

    What we use in finance is called a "tick" system. A column (this could be its own table in an RDBMS, or it could be stored in something like Redis) that stores changes over time. It would only need to record the product, datetime and price.

    So would I store the entire product "row" but with the added time stamp? Or would this just be the product ID with the single field and date stamp? Essentially just 3 or 4 columns for every single historical change?

    If I store the entire row with all it's columns, there would be a lot of columns of empty data. If I store just ONE field that has a change, then the table as a whole would increase a lot in row count.

    Either way I would have to pull every record matching an ID and parse out the history of each field. Not sure which method would be most performant or efficient.



  • @guyinpv said in Need a DB option for archiving ecommerce data:

    So would I store the entire product "row" but with the added time stamp? Or would this just be the product ID with the single field and date stamp? Essentially just 3 or 4 columns for every single historical change?

    Store the product "row" but do not store the price with it. Store that in a separate column. Then do a lookup against the product ID and time to coordinate the two.



  • @guyinpv said in Need a DB option for archiving ecommerce data:

    If I store the entire row with all it's columns, there would be a lot of columns of empty data. If I store just ONE field that has a change, then the table as a whole would increase a lot in row count.

    Anytime that that will or might happen, it tells you that you should have another table, in the relational database world.



  • @guyinpv said in Need a DB option for archiving ecommerce data:

    Either way I would have to pull every record matching an ID and parse out the history of each field. Not sure which method would be most performant or efficient.

    Only one method is even considered valid 🙂 Never let nulls exist in a relational database.



  • I still think something more like an object store makes more sense than relational.

    product = {
      pid = 12345,
      description = {
        [20160403, "Some long description"],
        [20160513, "Some long description2"],
        [20160821, "Some long description3"],
      },
      otherfield = {
        [20150414, "field value"],
        [20150414, "field value"],
      }  
    
    ...
    }
    

    This way, the entire product is still one record, but all those dated changes are just in sub objects and arrays. Seems more logical somehow than relational. I guess I'll have to play around with both ideas.



  • Nothing illogical about the relational, in fact, it's a "relationship" and exactly the kind of thing that relational databases are designed for. And since it is ecommerce data, it fits very, very much into the wheelhouse of relational. I'd argue that you are actually demonstrating the very use case that is where relational actually makes sense, rather than one where it doesn't.



  • @scottalanmiller said in Need a DB option for archiving ecommerce data:

    Nothing illogical about the relational, in fact, it's a "relationship" and exactly the kind of thing that relational databases are designed for. And since it is ecommerce data, it fits very, very much into the wheelhouse of relational. I'd argue that you are actually demonstrating the very use case that is where relational actually makes sense, rather than one where it doesn't.

    Yes, this is very much a relational design.


Log in to reply