123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698(* Yoann Padioleau
*
* Copyright (C) 2009, 2010 Facebook
*
* This library is free software; you can redistribute it and/or
* modify it under the terms of the GNU Lesser General Public License
* version 2.1 as published by the Free Software Foundation, with the
* special exception on linking described in file license.txt.
*
* This library is distributed in the hope that it will be useful, but
* WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the file
* license.txt for more details.
*)openCommonopenEntity_codemoduleJ=Json_typemoduleHC=Highlight_code(*****************************************************************************)(* Prelude *)(*****************************************************************************)(*
* This module provides a generic "database" of semantic information
* on a codebase (a la CIA [1]). The goal is to give access to
* information computed by a set of global static or dynamic analysis
* such as 'what are the number of callers to a certain function', 'what
* is the test coverage of a file', etc. This is mainly used by codemap
* to give semantic visual feedback on the code. See also layer_code.ml
* for complementary semantic information about a codebase.
*
* update: prolog_code.pl and Prolog may now be the prefered way to
* represent a code database, but for codemap it's still good to use
* this database.
*
* Each programming language analysis library usually provides
* a more powerful database (e.g. analyze_php/database/database_php.mli)
* with more information. Such a database is usually also efficiently stored
* on disk via BerkeleyDB. Nevertheless generic tools like
* codemap can benefit from a shorter and generic version of this
* database. Moreover, when we have codebase with multiple langages
* (e.g. PHP and javascript), having a common type can help for some
* analysis or visualization.
*
* Note that by storing this toy database in a JSON format or with Marshall,
* this database can also easily be read by multiple
* process at the same time (there is currently a few problems with
* concurrent access of Berkeley Db data; for instance one database
* created by a user can not even be read by another user ...).
* This also avoids forcing the user to spend time running all
* the global analysis on his own codebase. We can factorize the essential
* results of such long computation in a single file.
*
* An alternative would be to use the TAGS file or information from
* cscope. But this would require to implement a reader for those
* two formats. Moreover ctags/cscope do just lexical-based analysis
* so it's not a good basis and it contains only defition->position
* information.
*
* history:
* - started when working for eurosys'06 in patchparse/ in a file called
* c_info.ml
* - extended for eurosys'08 for coccinelle/ in coccinelle/extra/
* and use it to discover some .c .h mapping and generate some crazy
* graphs and also to detect drivers splitted in multiple files.
* - extended it for aComment in 2008 and 2009, to feed information to some
* inter-procedural analysis.
* - rewrite it for PHP in Nov 2009
* - adapted in Jan 2010 for flib_navigator
* - make it generic in Aug 2010 for my code/treemap visualizer
* - added comments about Prolog database which may be a better db for
* certain use cases.
*
* history bis:
* - Before, I was optimizing stuff by caching the ast in
* some xxx_raw files. But there was lots of small raw files;
* get lots of ast files and waste space. Also not good for random
* access to the asts. So better to use berkeley DB. My experience with
* LFS helped me a little as I was already using berkeley DB and glimpse.
*
* - I was also using glimpse and I tried to accelerate even more coccinelle
* to generate some mini C files so that glimpse can directly tell us
* the toplevel elements to look for. But this generates lots of
* very small mini C files which also waste lots of disk space.
*
* References:
* [1] CIA, the C Information Abstractor
*)(*****************************************************************************)(* Type *)(*****************************************************************************)(* How to store the id of an entity ? A int ? A name and hope few conflicts ?
* Using names will increase the size of the db which will slow down
* the loading of the database.
* So it's better to use an id. Moreover at some point we want to provide
* callers/callees navigations and more entities relationships
* so we need a real way to reference an entity.
*)typeentity_id=inttypeentity={e_kind:entity_kind;e_name:string;(* can be empty to save space when e_fullname = e_name *)e_fullname:string;e_file:Common.filename;e_pos:Common2.filepos;(* Semantic information that can be leverage by a code visualizer.
* The fields are set as mutable because usually we compute
* the set of all entities in a first phase and then we
* do another pass where we adjust numbers of other entity references.
*)(* todo: could give more importance when used externally not just
* from another file but from another directory!
* or could refine this int with more information.
*)mutablee_number_external_users:int;(* Usually the id of a unit test of pleac file.
*
* Indeed a simple algorithm to compute this list is:
* just look at the callers, filter the one in unit test or pleac files,
* then for each caller, look at the number of callees, and take
* the one with best ratio.
*
* With references to good examples of use, we can offer
* what Perl programmers had for years with their function
* documentations.
* If there is no examples_of_use then the user can visually
* see that some functions should be unit tested :)
*)mutablee_good_examples_of_use:entity_idlist;(* todo? code_rank ? this is more useful for number_internal_users
* when we want to know what is the core function in a module,
* even when it's called only once, but by a small wrapper that is
* itself very often called.
*)e_properties:propertylist;}(* Note that because we now use indexed entities, you can not
* play with.entities as before. For instance merging databases
* requires to adjust all the entity_id internal references.
*)typedatabase={(* The common root if the database was built with multiple dirs
* as an argument. Such a root is mostly useful when displaying
* filenames in which case we can strip the root from it
* (e.g. in the treemap browser when we mouse over a rectangle).
*)root:Common.dirname;(* Such list can be used in a search box powered by completion.
* The int is for the total number of times this files is
* externally referenced. Can be use for instance in the treemap
* to artificially augment the size of what is probably a more
* "important" file.
*)dirs:(Common.filename*int)list;(* see also build_top_k_sorted_entities_per_file for dynamically
* computed summary information for a file
*)files:(Common.filename*int)list;(* indexed by entity_id *)entities:entityarray;}letempty_database()={root="";dirs=[];files=[];entities=Array.of_list[];}letdefault_db_name="PFFF_DB.marshall"(*****************************************************************************)(* Json *)(*****************************************************************************)(*---------------------------------------------------------------------------*)(* json -> X *)(*---------------------------------------------------------------------------*)letjson_of_fileposx=J.Array[J.Intx.Common2.l;J.Intx.Common2.c]letjson_of_propertyx=matchxwith|ContainDynamicCall->J.Array[J.String"ContainDynamicCall"]|ContainReflectionCall->J.Array[J.String"ContainReflectionCall"]|TakeArgNByRefi->J.Array[J.String"TakeArgNByRef";J.Inti]|_->raiseTodoletjson_of_entitye=J.Object["k",J.String(string_of_entity_kinde.e_kind);"n",J.Stringe.e_name;"fn",J.Stringe.e_fullname;"f",J.Stringe.e_file;"p",json_of_filepose.e_pos;(* different from type *)"cnt",J.Inte.e_number_external_users;"u",J.Array(e.e_good_examples_of_use|>List.map(funid->J.Intid));"ps",J.Array(e.e_properties|>List.mapjson_of_property);]letjson_of_databasedb=J.Object["root",J.Stringdb.root;"dirs",J.Array(db.dirs|>List.map(fun(x,i)->J.Array([J.Stringx;J.Inti])));"files",J.Array(db.files|>List.map(fun(x,i)->J.Array([J.Stringx;J.Inti])));"entities",J.Array(db.entities|>Array.to_list|>List.mapjson_of_entity);](*---------------------------------------------------------------------------*)(* X -> json *)(*---------------------------------------------------------------------------*)letids_of_jsonjson=matchjsonwith|J.Arrayxs->xs|>List.map(function|J.Intid->id|_->failwith"bad json")|_->failwith"bad json"letfilepos_of_jsonjson=matchjsonwith|J.Array[J.Intl;J.Intc]->{Common2.l=l;Common2.c=c}|_->failwith"Bad json"letproperty_of_jsonjson=matchjsonwith|J.Array[J.String"ContainDynamicCall"]->ContainDynamicCall|J.Array[J.String"ContainReflectionCall"]->ContainReflectionCall|J.Array[J.String"TakeArgNByRef";J.Inti]->TakeArgNByRefi|_->failwith"property_of_json: bad json"letproperties_of_jsonjson=matchjsonwith|J.Arrayxs->xs|>List.mapproperty_of_json|_->failwith"Bad json"(* Reverse of json_of_entity_info; must follow same convention for the order
* of the fields.
*)letentity_of_json2json=matchjsonwith|J.Object["k",J.Stringe_kind;"n",J.Stringe_name;"fn",J.Stringe_fullname;"f",J.Stringe_file;"p",e_pos;(* different from type *)"cnt",J.Inte_number_external_users;"u",ids;"ps",properties;]->{e_kind=entity_kind_of_stringe_kind;e_name=e_name;e_file=e_file;e_fullname=e_fullname;e_pos=filepos_of_jsone_pos;e_number_external_users=e_number_external_users;e_good_examples_of_use=ids_of_jsonids;e_properties=properties_of_jsonproperties;}|_->failwith"Bad json"letentity_of_jsona=Common.profile_code"Db.entity_of_json"(fun()->entity_of_json2a)letdatabase_of_json2json=matchjsonwith|J.Object["root",J.Stringdb_root;"dirs",J.Arraydb_dirs;"files",J.Arraydb_files;"entities",J.Arraydb_entities;]->{root=db_root;dirs=db_dirs|>List.map(funjson->matchjsonwith|J.Array([J.Stringx;J.Inti])->x,i|_->failwith"Bad json");files=db_files|>List.map(funjson->matchjsonwith|J.Array([J.Stringx;J.Inti])->x,i|_->failwith"Bad json");entities=db_entities|>List.mapentity_of_json|>Array.of_list}|_->failwith"Bad json"letdatabase_of_jsonjson=Common.profile_code"Db.database_of_json"(fun()->database_of_json2json)(*****************************************************************************)(* Load/Save *)(*****************************************************************************)letload_database2file=pr2(spf"loading database: %s"file);ifFile_type.is_json_filenamefilethen(* This code is mostly obsolete. It's more efficient to use Marshall
* to store big database. This should be used only when
* one wants to have a readable database.
*)letjson=Common.profile_code"Json_in.load_json"(fun()->Json_io.load_jsonfile)indatabase_of_jsonjsonelseCommon2.get_valuefileletload_databasefile=Common.profile_code"Db.load_db"(fun()->load_database2file)(* We allow to save in JSON format because it may be useful to let
* the user edit read the generated data.
*
* less: could use the more efficient json pretty printer, but really
* marshall is probably better. Only biniou could be a valid alternative.
*)letsave_databasedatabasefile=ifFile_type.is_json_filenamefilethendatabase|>json_of_database|>Json_io.string_of_json~compact:false~recursive:false~allow_nan:true|>Common.write_file~fileelseCommon2.write_valuedatabasefile(*****************************************************************************)(* Entities categories *)(*****************************************************************************)(* coupling: if you add a new kind of entity, then
* don't forget to modify size_font_multiplier_of_categ in code_map/
*
* How sure this list is exhaustive ? C-c for usedef2
*)letentity_kind_of_highlight_category_defcateg=matchcategwith|HC.Entity(kind,HC.Def2_)->Somekind|HC.FunctionDecl_->SomePrototype|HC.StaticMethod(HC.Def2_)->SomeMethod|HC.StructName(HC.Def)->SomeType(* todo: what about other Def ? like Label, Parameter, etc ? *)|_->Noneletis_entity_def_categorycateg=entity_kind_of_highlight_category_defcateg<>None(* less: merge with other function? *)letentity_kind_of_highlight_category_usecateg=matchcategwith|HC.Entity(kind,HC.Use2_)->Somekind|HC.FunctionDecl_->SomeFunction|HC.StaticMethod(HC.Use2_)->SomeMethod|HC.StructNameHC.Use->SomeClass|_->Noneletmatching_def_short_kind_kindshort_kindkind=(matchshort_kind,kindwith(* Struct/Union are generated as Type for now in graph_code_clang.ml *)|Class,Type->true|Global,GlobalExtern->true|Function,Prototype->true|a,b->a=*=b)(* See the code of the different highlight_code_xxx.ml to
* know the different possible pairs.
* todo: merge with other functions too?
*)letmatching_use_categ_kindcategkind=matchkind,categwith|kind1,HC.Entity(kind2,_)whenkind1=*=kind2->true|Prototype,HC.Entity(Function,_)|Constructor,HC.ConstructorMatch_|GlobalExtern,HC.Entity(Global,_)|Method,HC.StaticMethod_|ClassConstant,HC.Entity(Constant,_)(* tofix at some point, wrong tokenizer *)|Constant,HC.Local_|Global,HC.Local_|Function,HC.Local_|Constructor,HC.Entity(Global,_)|Function,HC.Builtin|Function,HC.BuiltinCommentColor|Function,HC.BuiltinBoolean(* because what looks like a constant is actually a partially applied func *)|Function,HC.Entity(Constant,_)(* function pointers in structure initialized (poor's man oo in C) *)|Function,HC.Entity(Global,_)(* function calls to pointer function via direct syntax *)|GlobalExtern,HC.Entity(Function,_)|Global,HC.UseOfRef|Field,HC.UseOfRef->true|_->false(* In database_light_xxx we sometimes need, given a 'use', to increment
* the e_number_external_users counter of an entity. Nevertheless
* multiple entities may have the same name in which case looking
* for an entity in the environment will return multiple
* entities of different kinds. Here we filter back the
* non valid entities.
*)letentity_and_highlight_category_correpondanceentitycateg=letentity_kind_use=Common2.some(entity_kind_of_highlight_category_usecateg)inentity.e_kind=entity_kind_use(*****************************************************************************)(* Misc *)(*****************************************************************************)(* When we compute the light database for a language we usually start
* by calling a function to get the set of files in this language
* (e.g. Lib_parsing_ml.find_all_ml_files) and then "infer"
* the set of directories used by those files by just calling dirname
* on them. Nevertheless in the search box of the visualizer we
* want to propose for instance flib/herald even if there is no
* php file in flib/herald but files in flib/herald/lib/foo.php.
* Having flib/herald/lib is not enough. Enter alldirs_and_parent_dirs_of_dirs
* which will compute all the directories.
*
* It's a kind of 'find -type d' but reversed, using a set of complete dirs
* as the starting point. In fact we could define a
* Common.dirs_of_dirs but then directory without any interesting files
* would be listed.
*)letalldirs_and_parent_dirs_of_relative_dirsdirs=dirs|>List.mapCommon2.inits_of_relative_dir|>List.flatten|>Common2.uniq_effletmerge_databasesdb1db2=(* assert same root ?then can just add the fields *)ifdb1.root<>db2.rootthenbeginpr2(spf"merge_database: the root differs, %s != %s"db1.rootdb2.root);ifnot(Common2.y_or_no"Continue ?")thenfailwith"ok we stop";end;(* entities now contain references to other entities through
* the index to the entities array. So concatenating 2 array
* entities requires care.
*)letlength_entities1=Array.lengthdb1.entitiesinletdb2_entities=db2.entitiesinletdb2_entities_adjusted=db2_entities|>Array.map(fune->{ewithe_good_examples_of_use=e.e_good_examples_of_use|>List.map(funid->id+length_entities1);})in{root=db1.root;dirs=(db1.dirs@db2.dirs)|>Common.group_assoc_bykey_eff|>List.map(fun(file,xs)->file,Common2.sumxs);files=db1.files@db2.files;(* should ensure exclusive ? *)entities=Array.appenddb1.entitiesdb2_entities_adjusted;}letbuild_top_k_sorted_entities_per_file2~kxs=xs|>Array.to_list|>List.map(fune->e.e_file,e)|>Common.group_assoc_bykey_eff|>List.map(fun(file,xs)->file,(xs|>List.sort(fune1e2->(* high first *)comparee2.e_number_external_userse1.e_number_external_users)|>Common.take_safek))|>Common.hash_of_listletbuild_top_k_sorted_entities_per_file~kxs=Common.profile_code"Db.build_sorted_entities"(fun()->build_top_k_sorted_entities_per_file2~kxs)letmk_dir_entitydirn={e_name=Common2.basenamedir^"/";e_fullname="";e_file=dir;e_pos={Common2.l=1;c=0};e_kind=Dir;e_number_external_users=n;e_good_examples_of_use=[];e_properties=[];}letmk_file_entityfilen={e_name=Common2.basenamefile;e_fullname="";e_file=file;e_pos={Common2.l=1;c=0};e_kind=File;e_number_external_users=n;e_good_examples_of_use=[];e_properties=[];}letmk_multi_dirs_entitynamedirs_entities=letdirs_fullnames=dirs_entities|>List.map(fune->e.e_file)in{e_name=name^"//";(* hack *)e_fullname="";(* hack *)e_file=Common.join"|"dirs_fullnames;e_pos={Common2.l=1;c=0};e_kind=MultiDirs;e_number_external_users=(* todo? *)(List.lengthdirs_fullnames);e_good_examples_of_use=[];e_properties=[];}letmulti_dirs_entities_of_dirses=leth=Hashtbl.create101ines|>List.iter(fune->Hashtbl.addhe.e_namee);letkeys=Common2.hkeyshinkeys|>Common.map_filter(funk->letvs=Hashtbl.find_allhkinifList.lengthvs>1thenSome(mk_multi_dirs_entitykvs)elseNone)letfiles_and_dirs_database_from_files~rootfiles=(* quite similar to what we first do in a database_light_xxx.ml *)letdirs=files|>List.mapFilename.dirname|>Common2.uniq_effinletdirs=dirs|>List.map(funs->Common.readable~roots)inletdirs=alldirs_and_parent_dirs_of_relative_dirsdirsin{root=root;dirs=dirs|>List.map(fund->d,0);(* TODO *)files=files|>List.map(funf->Common.readable~rootf,0);(* TODO *)entities=[||];}letfiles_and_dirs_and_sorted_entities_for_completion2~threshold_too_many_entitiesdb=letnb_entities=Array.lengthdb.entitiesinletdirs=db.dirs|>List.map(fun(dir,n)->mk_dir_entitydirn)inletfiles=db.files|>List.map(fun(file,n)->mk_file_entityfilen)inletmultidirs=multi_dirs_entities_of_dirsdirsinletxs=multidirs@dirs@files@(ifnb_entities>threshold_too_many_entitiesthenbeginpr2"Too many entities. Completion just for filenames";[]endelse(db.entities|>Array.to_list|>List.map(fune->(* we used to return 2 entities per entity by having
* both an entity with the short name and one with the long
* name, but now that we do a suffix search, no need
* to keep the short one
*)ife.e_fullname=""theneelse{ewithe_name=e.e_fullname})))in(* note: return first the dirs and files so that when offer
* completion the dirs and files will be proposed first
* (could also enforce this rule when building the gtk completion model).
*)xs|>List.map(fune->(matche.e_kindwith|MultiDirs->100|Dir->40|File->20|_->e.e_number_external_users),e)|>Common.sort_by_key_highfirst|>List.mapsndletfiles_and_dirs_and_sorted_entities_for_completion~threshold_too_many_entitiesa=Common.profile_code"Db.sorted_entities"(fun()->files_and_dirs_and_sorted_entities_for_completion2~threshold_too_many_entitiesa)(* The e_number_external_users count is not always very accurate for methods
* when we do very trivial class/methods analysis for some languages.
* This helper function can compensate back this approximation.
*)letadjust_method_or_field_external_users~verboseentities=(* phase1: collect all method counts *)leth_method_def_count=Common2.hash_with_default(fun()->0)inentities|>Array.iter(fune->matche.e_kindwith|Method|Field->letk=e.e_nameinh_method_def_count#updatek(Common2.add1)|_->());(* phase2: adjust *)entities|>Array.iter(fune->matche.e_kindwith|Method|Field->letk=e.e_nameinletnb_defs=h_method_def_count#assockinifnb_defs>1&&verbosethenpr2("Adjusting: "^e.e_fullname);letorig_number=e.e_number_external_usersine.e_number_external_users<-orig_number/nb_defs;|_->());()