Mongo Charts Without Swarm

Mongo charts is a nice visualizer for MongoDB.  Its young and I desire much more functionality but it’s a nice quick and dirty.  I suspect easier than building a Pandas or Matplotlib visualizer but I also suspect there are things that do this much better like Grafana.  Anyway, here is how to set the darn thing up without a swarm which is the documented way from mongo.  This took some tinkering to get right.  Note that this creates persistent volumes in directories rather than using the docker volume facility.

I have exposed this instance with Traefik but obfuscated that part as well as internal details.

 

I also recommend persisting your mongo data to folders if you’re running it in containers.  Either method below works

docker run -blah -blah -blah -v host:container banana_hammock:latest

or 

volumes:  
  - host:container

 

version: "3.3"

services:
  charts:
    image: quay.io/mongodb/charts:v0.12.0
    container_name: charts
    # hostname: charts
    ports:
      # host:container port mapping. If you want MongoDB Charts to be
      # reachable on a different port on the docker host, change this
      # to :80, e.g. 8888:80.
      - 80:80
    volumes:
      - /yourstuff/keys:/mongodb-charts/volumes/keys
      - /yourstuff/mongodb-charts/data/logs:/mongodb-charts/volumes/logs
      - /yourstuff/ongodb-charts/data/db-certs:/mongodb-charts/volumes/db-certs
      - /yourstuff/mongodb-charts/data/web-certs:/mongodb-charts/volumes/web-certs
      - /yourstuff/mongodb-charts/charts-mongodb-uri:/run/secrets/charts-mongodb-uri
    environment:
      # The presence of following 2 environment variables will enable HTTPS on Charts server.
      # All HTTP requests will be redirected to HTTPS as well.
      # To enable HTTPS, upload your certificate and key file to the web-certs volume,
      # uncomment the following lines and replace with the names of your certificate and key file.
      # CHARTS_HTTPS_CERTIFICATE_FILE: charts-https.crt
      # CHARTS_HTTPS_CERTIFICATE_KEY_FILE: charts-https.key

      # This environment variable controls the built-in support widget and
      # metrics collection in MongoDB Charts. To disable both, set the value
      # to "off". The default is "on".
      CHARTS_SUPPORT_WIDGET_AND_METRICS: "off"
      # Directory where you can upload SSL certificates (.pem format) which
      # should be considered trusted self-signed or root certificates when
      # Charts is accessing MongoDB servers with ?ssl=true
      # SSL_CERT_DIR: /mongodb-charts/volumes/db-certs
      CHARTS_MONGODB_URI: "mongodb://:@/admin"
    networks:
      - net1
      - net2
    labels:
      - "traefik.enable=true"
      - "traefik.frontend.rule=Host:banana.hammock.com"
      - "traefik.backend=banana"
      - "traefik.port=80"
      - "traefik.docker.network=net1"
      - "traefik.frontend.headers.SSLRedirect=true"
      - "traefik.frontend.headers.STSSeconds=315360000"
      - "traefik.frontend.headers.browserXSSFilter=true"
      - "traefik.frontend.headers.contentTypeNosniff=true"
      - "traefik.frontend.headers.forceSTSHeader=true"
      - "traefik.frontend.headers.SSLHost=lamkerad.com"
      - "traefik.frontend.headers.STSIncludeSubdomains=true"
      - "traefik.frontend.headers.STSPreload=true"
      - "traefik.frontend.headers.frameDeny=true"
    # links: 
      # - mongo_mongo_1


networks:
  net2:
    external:
      name: net2
  net1:
    external: true
Advertisements

Wacky Price Shifts: Part 5 of scraping data

One cool thing about dwelling on the data is that you get to see wack-a-doody fluctuations and drill down on them.  Here is one shoe that I would never wear, but someone must want.  I think this is the greatest price fluctuation that I have seen.

High $87.99

Low $33.00

downloadClipboard01

Artificial Discounts: Part 3 of scraping data

This is a follow up to part 2.

TL; DR: Sites use deceptive marketing scheme of raising the price and then discounting the raised price to make it appear like it is a better deal.


Now that prices have been scraped for a week, it is becoming a little easier to recognize the trend data.  It also helps that there was a “$10 Off Selected Shoes” which makes the data more sensible in my opinion.  There was a sale and most shoes were truly $10 off.  But at the lower end where margins are thin, the price went UP at the time of discount leaving a factitious $10 off.

As I mentioned in the previous post, I think this is deceptive but that can be argued ad infinitum.  The bottom line is that most of the shoes were cheaper while on sale.  There were outlier fluctuations that were independent of the sale.  I suspect that some were based on stock and some were discounted to promote the shoe, and then the price raised.  sale_end.png

What is a bummer is that price after the sale is higher than before.

 

There are small pockets of opportunity to get discounts before prices go up.  For example the 1260 dropped to a low in the mid – $90 range before shooting back up.

 

1260

Mongo DB Recipes

Clone one collection to another (Mongo Shell):

First, Drop Collection (scrapy_dev in this case):
db.scrapy_dev.drop()

Then clone it to the dropped collection
db.joesmens.find().forEach(function(o) { db.scrapy_dev.insert(o); });

it requires an empty collection.  This circumvents authorization issues with the copyTo() function.


Log into Mongo Shell with Authentication (local login from docker container to shell):

mongo --username glamke --password glamke --authenticationDatabase admin joesmens

 


PyMongo: Collection level operations

#collection level operation to remove one.
#pops document field 'observation'
# viz it clears all data from observation array (which is a subdocument)
items.update_one({},{"$pop":{'observation':-1}})

PyMongo: Document level operations (uses dot notation to select from array)

#remove subdocument field from last subdoc index (-1)
items.update_many({},{"$unset":{'observation.0.price':1}})

#remove subdocument field from first subdoc index (0)
items.update_many({},{"$unset":{'observation.0.price':1}})

Artificial Discounts: Part 2 of scraping data

This is a follow up to part 1.

TL; DR: Sites use deceptive marketing scheme of raising the price and then discounting the raised price to make it appear like it is a better deal.


Data is starting to flow in now that I have my schema established with mongo.  The todo list is getting longer as this matures, but for now I am running the scraper manually to avoid inundating the database.  My next to do is set up a scrapyd instance on my server and cron the scraping to every 6 hours, which is a subjective balance between information awareness and information overload.

Immediate To do (by priority):

  1. scrapy cron job set to 4 hours
  2. start looking into map-reduce to eliminate unchanged data points and compact the data (first time with database so it’s all new and speaking python to a javascript database can be confusing)
  3. fix mongo charts axis titles

The first interesting thing is that there have been tracked price and % off discount changes.

Price on Y axis with the lines each representing a product. X axis is time. Some products have sold out leading to discontiguous lines.

Price over time

MongoDash

So the overall trend is changing prices with higher priced items going down and lower priced items going up.  The discount % below puts this in context.  (A good graph would be the change in price plotted against discount, viz. the derivative plotted versus the discount).


Product Discount over time

MongoDash

Things get more interesting here with discounts converging at the 40% mark .


I suspect this is because there is a promo on the front page which is below.  This allows them to have a promo without hurting the bottom line too much, but it makes you realize the discount % is an illusion.  composite

Not everything got better pricing but certain items did.  Empathizing with the retailer, the shoes that went up in “base cost” are still a better bargain with the sale; this is based on this time series data:

{
"price": 41.99,
"disc": 30,
"savings": 18,
"time": ISODate("2019-04-16T01:19:49.585Z")
},
{
"price": 35.99,
"disc": 40,
"savings": 24,
"time": ISODate("2019-04-16T08:41:03.342Z")
}

So this is a typical deceptive marketing scheme of raising the price and then discounting the raised price to make it appear like it is a better deal.

To clarify the interpretation above, the product was $41.99 @ 30% off.  The base price was then raised to $45.99 (this data can’t be seen because of the $10 off promo), and then with the $10 off promo, the price comes down to $35.99 @ 40% off.

Confused?  Me too. 

This is apparently termed artificial discounting based on this article:

https://www.news.com.au/technology/australian-online-retailer-kogan-fined-32400-for-raising-prices-before-discounting-goods/news-story/a912f247a63dd2db0bd7a109ba3606ee

https://www.nbcnews.com/business/consumer/fake-sales-trick-customers-major-stores-study-says-n366676

 

 

 

 

Scraping with Mongo, scrapy, and time series schema in NoSQL

Time series data is challenging to store. It can be dealt with either an SQL or a NoSQL data structure.  Scrapy has guides for using MongoDB, so I went with that for ease of use of the pipeline.

The commands to get data in are complex for me as a new user, especially the query and upsert components.

 

here is the code I used to push my data from scrapy into the db using a heirarchical “bucketing” approach to the time series observations.

 

 self.db[self.collection_name].update({"$and":[
{'link_relative': item['link_relative']}, 
{'item_name': item['item_name']},
{'click': item['click']}
]},
{"$push": {'observation':item['observation']}},
upsert=True)

To break this down:

the update parameters break down like this:

{"$and":

this is the query parameter to find the matching entry in the DB.  This is necessary to find the correct preexisting document.  In this case, it matches on the following criteria

{'link_relative': item['link_relative']}, 
{'item_name': item['item_name']},
{'click': item['click']}
]},

These 3 items are unique (when all present) to the desired record.  Note that there can be several ‘item_name” records that are the same, but by using the ‘$and’ operator, it requires all 3 to match or a new document will be created.


The observation data is  a dictionary of scraped items that gets fed to mongo into the record that matches.

 

{"$push": {'observation':item['observation']}},
upsert=True)

 

This yields a data structure in mongo that looks like the following:

    "item_name": "banana hammock",
    "link_relative": "/product/banana_hammocks/mini_hammock",
    "click": "=HYPERLINK(\"https://www.bananhammocks.com/product/mega_hammocks/mini_hammock\",\"GO\")",
    "observation": [
        {
            "price": 52.49,
            "disc": 30,
            "savings": 22.5,
            "time": ISODate("2019-04-14T14:37:19.850Z")
        },...
        {
            "price": 52.49,
            "disc": 30,
            "savings": 22.5,
            "time": ISODate("2019-04-14T18:12:00.989Z")
        }
    ]
}

There is the top level document that consists of

"item_name",
"link_relative", "click"

with the first two self explanatory and the click is an excel format of the clickable link that gets polled and inserted into CSV for use in excel.  It probably doesn’t need to be stored since it can be reconstructed at the application level from the relative link, but it is a holdover from storing data to simple csv for excel.

 

Thats it for this post, which is intended to help me understand how I got it working and remind me how to do for other cases.  If you found this by googling, I hope it helps you.