[TRU Research] Web App Data Schema

Sun Aug 11 12:54:11 PDT 2019

Regarding where to have this discussion - I'm just gonna continue the 
email chain cause I haven't followed where to put the discussion on the 
wiki, but someone feel free to steer it over there if we want to!

A brief update regarding establishing a larger list of employers to 
include in the dataset: basic contact information for all seattle 
businesses, sorted by the North American Industry Classification System, 
is available at 
https://web6.seattle.gov/fas/slimbizsearch/ResultsPage.aspx?NAICList=Top100, 
but it's a huge list of course, with no info on number of employees or 
revenue to filter out the smaller ones. Still, I did send off an email 
about getting a copy of the database just for purposes of 
cross-referencing names and such.
On 8/10/19 6:42 PM, Katie Wilson wrote:
> For “neighborhood” I think it makes sense to use the “CTR Network 
> Areas” as defined here 
> <https://www.seattle.gov/transportation/projects-and-programs/programs/transportation-options-program/commute-trip-reduction-program/draft-2019-2023-networks-and-targets>.
>
> For “industry” I think it makes sense to use the “Employment Sector” 
> categories listed on Page 12 of this CTR strategic plan. 
> <https://www.seattle.gov/Documents/Departments/SDOT/TransportationOptionsProgram/CTR_Draft_Strategic_Plan_Jan2019.pdf>
>
> On the ratings, I think it does make sense to lump "piggy bank" and 
> "brown tortoise" in the same rating (0), and then add a tortoise badge 
> for employers that aren’t even doing the pre-tax thing.
>
> Another simplification option to consider would be to lump together 3 
> and 4 leaves. But let’s leave them separate for now and depending on 
> how things shake out we can easily combine them later.
>
> We don’t have any major sources of data on what benefits employers 
> provide other than:
> — Metro public disclosure request spreadsheet 
> <https://seattletransitpasses-research.pbworks.com/w/page/133438080/First%20Public%20Records%20Request>
> — Our commute survey
> — Info gleaned online from company websites, asking around, glassdoor 
> etc (what I’ve found I’ve added to the relevant tables in the wiki 
> <https://seattletransitpasses-research.pbworks.com/w/page/132177123/Employers>, 
> on CTR employers and “potential poster children” and “likely target 
> assessment” and “hotels”)
>
> Maybe it makes sense to have another string indicating sufficient 
> certainty — when we have two sources, or one very reliable source, we 
> enter an X or whatever, and that gives us the green light to display 
> that data. Also it may not make sense to put a lot of work into 
> categorizing employers into Network Area and Employment Sector until 
> we have reliable data on what benefits they’re offering.
>
> Speaking of Seattle Coffee Works, I spoke with their HR person a few 
> months ago and actually employees have to pay $20/month (pre-tax $) if 
> they want an ORCA card. Still a great deal but not 100% subsidy as 
> reported in the Metro data— which, I then learned, is self-reported by 
> the company. Metro only knows that all those companies are signed up 
> for the Passport program. I noted the real situation on this page 
> <https://seattletransitpasses-research.pbworks.com/w/page/133439169/Potential%20Poster%20Children>. 
> Anyway, the point is we should probably crosscheck the Metro data as 
> much as we can with our survey or other sources of information.
>
> (Also speaking of Seattle Coffee Works they have locations in Capitol 
> Hill & Cascade too 
> <https://www.seattlecoffeeworks.com/our-cafes.aspx>. From talking with 
> the HR person I’m pretty sure all are include in their passport 
> program, and the employees swap around a lot from location to 
> location. They probably use the Ballard location as home base for 
> transit pass purposes since that’s the least expensive zone.)
>
> One project would be to come up with a list of employers that have 
> name recognition (or that we are interested in for some other reason) 
> and put a little work into attaining sufficient certainty. If we 
> posted the list to a page and put a call out on social media and email 
> I bet we’d get some answers.
>
>> On Aug 8, 2019, at 5:26 PM, Stephen DeSanto <rachidian at gmail.com 
>> <mailto:rachidian at gmail.com>> wrote:
>>
>> Hi everyone,
>>
>> I've taken a first pass at the data schema for showing employer 
>> transit benefits in our upcoming web app. In this draft, each 
>> employer record is represented as follows:
>>
>> {
>>     "employer": string,
>>     "industry": [string],
>>     "neighborhood": [string],
>>     "alias": [string],
>>     "rating": int,
>>     "description": string
>>     "badges": [string]
>> }
>>
>> *Employer* is a plain text string.
>> *Industry* is a list of strings (or a single string, if we want to 
>> limit one employer = one industry).
>> *Neighborhood* is treated similarly to industry
>> *Alias* is a list of other names for the same company. For example,
>> *Rating* is a numerical scale that represents the "worker's monthly 
>> cost of an unlimited transit pass". The scale provided during the 
>> meeting went from "4 leaves" to "brown tortoise"; aligning to the 
>> leaves, that gives us a scale of [-1, 0, 1, 2, 3, 4]. We could adjust 
>> this up to 0-5, or lump "piggy bank" and "brown tortoise" in the same 
>> rating.
>> *Description* is a string that describes the employer's transit 
>> benefits, i.e. why they got the rating they did.
>> *Badges* is a list of strings that represent any additional 
>> categories we want to assign to a company (e.g. "industry leader", 
>> "polluter").
>>
>> We can make changes to this schema if it makes it easier to work with 
>> our underlying data visualization platform (e.g. Tableau? 
>> DataTables?), but hopefully this is a suitable starting place.
>>
>> As an example, take a hypothetical record for Seattle Coffee Works.
>>
>> {
>>     "employer": "Seattle Coffee Works",
>>     "industry": ["restaurant"],
>>     "neighborhood": ["cbd", "ballard"],
>>     "alias": ["Ballard Coffee Works"],
>>     "rating": 4,
>>     "description": "Provides 100% ORCA Passport subsidy."
>>     "badges": ["leader"]
>> }
>>
>> *Where Our Data Lives (For Now)*
>>
>> I've also taken a rough chop at getting started with the data. Here, 
>> I've just taken the raw list of ORCA Business Passport employers and 
>> assigned a score based on their subsidy percentage, as an example:
>>
>> https://docs.google.com/spreadsheets/d/1HmOcG7hJLD1G0unCMPcsDnXr4RIA_PMKEE5ne-hhQR8/edit?usp=sharing
>>
>> The spreadsheet contains columns for each item of the employer 
>> record, as well as some additional columns to record the raw data we 
>> have on file for that employer, so we can use that data to 
>> automatically or manually determine an employer's rating.
>>
>> If we have data from other sources not listed (e.g. survey data, City 
>> of Seattle data), the "source_" columns can be renamed or added to 
>> represent that source's data. For example, if I want to add data from 
>> the TRU survey, I might rename "__source_b" to "__TRU Survey", then 
>> include results from that survey in that column for each company. 
>> (The columns beginning with two underscores are ones I don't expect 
>> to be publicly available.)
>>
>> PBworks feels really inadequate for editing large data sets, and I 
>> don't know where else to put it, so it's living in Google Sheets for 
>> now. Set to read-only with the link, for now, but please request 
>> editing permissions so you can add stuff to the sheet.
>>
>> Currently, my expectation is that the spreadsheet will be hand-edited 
>> in Google Sheets, and then when we're ready to put live data in the 
>> web app, we can export the sheet to a flat file, which we can then 
>> import into a format appropriate for the website (big ol' JSON file, 
>> database, whatever). Manual process, but probably fine for a project 
>> of this scale; I'm open to alternatives.
>>
>> *Things To Do Next*
>>
>> Aside from the ORCA Passport data and the data we collected through 
>> TRU survey / legwork (on PBworks), do we have any other data sources 
>> that would provide context for a score?
>>
>> For the data sources we have, we'll have to start filling out the 
>> rest of the spreadsheet, I guess?
>>
>> Also, we will need to determine:
>> a) master list of "industries" we want to support, and
>> b) "industry" field(s) for each employer
>> c) "neighborhood" field(s) for each employer we don't have one for 
>> (or being more precise than what I have now)
>> d) which companies get tagged with which badges
>>
>> Hope that helps.
>>
>> In solidarity,
>>
>> Stephen
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.transitriders.org/pipermail/research/attachments/20190811/6af0d319/attachment.html>